i 
tf 


He 
i 


ae 


ase 


ce 


MAL 11 1970 


Vuous oe 


ais 


American Business Series 


GENERAL EDITOR 
ROSWELL C. McCREA 


Professor of Economics in Columbia University 


Hmevrican Business Series 


Under the Editorship of 
RoswELit C. McCrea, Columbia University 


Economic History of the United States 
By T. W. Van Metre, Columbia University 


Business Ownership Organization 

By ArcHispaLp H. StocKpEr, Columbia University 
Outlines of Accounting 

By WiitAm S. Krezgs, Washington University, St. Louis 
Manpower in Industry 

By E. S. Cowprick 


Our Competitors and Markets 
By A. W. Lawes, Consultant on Foreign Markets 


Making Use of a Bank 
By J. A. FitzcEratp, Ohio State University 
International Trade Finance 
By GeEorcE W. Epwarps, New York University 
Discount Policy of the Federal Reserve System 
By B. H. Becxuart, Columbia University 
The Shoe Industry 
By Freperick J. ALLEN, Harvard University 


The Transportation Act, 1920 
By Rogers MacVEacH, of the New York and Oregon Bars 


Advanced Accounting 
By WittiAmM S. Kress 


Our Financial Reformation 
By CuHEsTer A. Puitiips, University of Iowa 


Statistical Methods 
By Frepertck C. Mitts, Columbia University 
Money and Banking 
By Witram H. STEINER, College of the City of New York 
Business Economics 
By Ray B. WEsTERFIELD, Yale University 
Labor Economics 
By SoLomon Buium, University of California 
Theory of Accounts, Volume I 
By Dr Scott, University of Missouri 


Industrial and Commercial Geography, New Edition 
By J. Russet, Smitu, Columbia University 


STATISTICAL METHODS 


APPLIED TO ECONOMICS 
AND BUSINESS 


BY 
FREDERICK CECIL MILLS 


ASSOCIATE PROFESSOR OF BUSINESS STATISTICS 
COLUMBIA UNIVERSITY 


ip 
Bist. ic Ci in cQUE 


NEW YORK 
HENRY HOLT AND COMPANY 


CopyRiIcuT, 1924, 
BY 


HENRY HOLT AND COMPANY 


August, 1930 


PRINTED IN THE 
UNITED STATES OF AMERICA 


Digitized by the Internet Archive 
in 2022 with funding from 
Kahle/Austin Foundation 


https://archive.org/details/statisticalmethoO000fred 


PREFACE 


Tue last decade has witnessed a remarkable stimulation of 
interest in quantitative methods in business and in the social 
sciences. The day when intuition was the chief basis of business 
judgment and unsupported hypothesis the mode in social studies 
seems to have passed. Following the lead of workers in the older 
and traditionally more accurate physical sciences, social scientists 
and serious students of business are employing in greater measure 
than ever before a method of study based upon the observation 
and analysis of facts. When these observations are quantitative 
in character appropriate methods are necessary for their organiza- 
tion and interpretation. This book deals with methods of com- 
bining and analyzing such observations, with primary emphasis 
upon materials drawn from the fields of economics and business. 

The justification for limiting the treatment to these particular 
fields is twofold. Although general statistical methods are prac- 
tically universal in their application, special problems are en- 
countered in every field of study. This is particularly true in the 
realm of economics, which presents many distinctive difficulties 
and many characteristic problems. Methods which are in some 
degree specialized to meet these particular requirements have 
been developed, and these methods call for treatment in a work 
which is restricted in scope. In the second place, methods can 
be most effectively explained in terms of particular subjects; 
abstract methodology is barren of interest to the average person. 
For these reasons the book has been written with reference to 
the specific needs of quantitative workers in economics and 
business. 

In the explanation of methods no attempt has been made to 
secure the brevity of exposition which may be desirable in a 
strictly mathematical work. The purpose throughout has been 
to write for the learner not for the finished master, and the expla- 
nations have been prepared with the needs of the former in mind. 
T have felt free to omit certain detailed demonstrations of theorems 

vu 


vill PREFACE 


because this book is presented as an introduction to the subject, 
not as an exhaustive treatise. 

The methods of quantitative analysis which are in general use 
today represent a long accretion, an accumulation of contribu- 
tions from workers in many fields. It would be vain to attempt 
to enumerate all the individuals who have contributed to the 
development of the science of statistics. Individual references 
are given in particular cases in the body of the text, but no list 
of such acknowledgments can serve as a complete record of the 
debt modern statisticians owe to their predecessors. 

For assistance in the preparation of the material contained in this 
book I am under many obligations. To Mr. H. E. Anderson and 
Mr. H. B. Killough I am indebted for certain of the data employed 
in Chapters XI, XIII and XIV. Professor Warren M. Persons 
of the Harvard Committee on Economic Research has courteously 
permitted me to make use of certain results of his work on com- 
modity price index numbers. The index of building costs used 
in Chapter VII and the index of business conditions presented in 
Chapter IX are products of the Statistical Division of the Ameri- 
can Telephone and Telegraph Company. I have employed them 
with the kind permission of Mr. Seymour L. Andrew, Chief Sta- 
tistician. Suggestions from Professor A. H. Mowbray of the Uni- 
versity of California have enabled me to remove several obscurities 
which were present in an earlier mimeographed edition. I am 
deeply grateful to Professors Henry L. Moore and Theodore H. 
Brown of Columbia University and to Mr. Henry Schultz of the 
Institute of Economics for their help in critically reviewing por- 
tions of the manuscript. Mr. Herbert F. Tutt has rendered 
valuable aid in the preparation of the charts. For assistance at 
every stage of the work involved in the writing of this book I am 
under deep obligation to Mr. Donald H. Davenport. His aid 
in the collection of material, in the preparation of charts and in 
the onerous task of seeing the book through the press has been 
invaluable. To my wife, above all others, I am indebted for a 
measure of constant and generous help which cannot be adequately 
acknowledged here. x. 


FY CMs 
November, 1924. 


CONTENTS 
CHAPTER 
I. SratisticaL Mernops anp THE PrRosiems or Eco- 
NOMICS AND BUSINESS 
II. Grapuic PRESENTATION 


III. Tur OrGAnNIzATION OF Sraestione Dict Tae FRr- 


QUENCY DISTRIBUTION Ben 
IV. DEscRIPTION OF THE Qasa er Disnrsyr on 
AVERAGES , Cet eat Suheinue 
V. DESCRIPTION OF THE Peeocancy Distutention: 
MEASURES OF VARIATION AND SKEWNESS 
VI. InpEx Numpers or PRICES a 
VII. Tue Anatysis or Time Series: Maabenunier OF 
TREND eo, ee ee 
VIII. Tue ANALysIS OF OP rire ‘Sournes Sy eran ee OF 
SEASONAL AND CyciicaL FLucruatTions 
IX. Inpex Numpers or Puysican VoLUME ; 
X. Tor Measurement or Revationsuip: LINEAR Con 
RELATION se en ei Ne on yA 
XI. Tot MrASUREMENT OF ee Onenie BETWEEN 
TIME SERIES : eaepiks 
XII. Tot Measurement OF Beri iONeie: Noncuinan 
CORRELATION A ee Py heey et De 
XIII. Tor Measurement or RELATIONSHIP AND THE 
Propiem oF ESTIMATION . etn ROR oaee 
XIV. Tur MerasurEMENT OF Baianoncair: MULTIPLE 
AND ParTIAL CORRELATION . ae : 
XV. ELEMENTARY PROBABILITIES AND THE nares 
CurvVE oF ERROR 
XVI. SratisticaL INDUCTION AND THE Abe opnat OF ere 
PLING : 
AppenpiIx A. THE Marzon’ OF Teer a Ee AS Mg pees 
TO CERTAIN STATISTICAL PROBLEMS 
Apprenpix B. GLossaRy OF SYMBOLS 
List or REFERENCES 
INDEX 


PAGE 


61 


97 


147 
169 


252 


315 
344 


363 


410 


433 


485 
516 
548 
562 
582 


589 
595 


LIST OF CHARTS 


FIGURE 


12. 


13. 


14. 


15. 


16. 


We 
18. 


Location of a Point with Reference to Rectangular Co- 
ordinates 

Cotton Prices in the Abaited States, ths Months: ‘Dine 
the Year 1923 . Bid a cee Ni nS mis el OARS 

Graph of the Equationy=a2 .. 

Graph of the Equation y = 3% + 2 

Parabola: Graph of the Equation y = 2? . . 

Equilateral Hyperbola: Graph of the Equation: ae rt 

Exponential Curve: Graph of the Equation y = 27 . 

Sine Curve: Graph of the Equation y = Sinz . 

Graph of the Equation log y = 2 log x 

Graph of the Equation y = 2. (Plotted on paper ri 
logarithmic scales) 


. The Compound Interest ovat The Growit of $10. 00 


at Compound Interest at 6% for 100 Years. (Plotted 
on the arithmetic scale) . , 

The Compound Interest ar The Craik of $10. 00 
at Compound Interest at 6% for 100 Years. (Plotted 
on the semi-logarithmic or ratio scale) . . . 

Wheat Flour Exports from the United OSS 1901- 1923 

Production of Steel in the United States, 1896-1922. 
(Plotted on the semi-logarithmic scale) 

Sales of the Acme Corporation, 1910-1923. howe ine 
Total Sales in the United States and the Sales in 
Certain Subdivisions. (Plotted on the arithmetic 
scale) . , 

Sales of the ere Gorporavions 1910- 1923. Bieone the 
Total Sales in the United States and the Sales in 
Certain Subdivisions. With Scales of Increase, De- 
crease and Comparison. (Plotted on spar eeiane 
paper) ; 

Farms in New Pagland Statesd in 1920. 

Elements of Production Costs of the XYZ Conporation? 
by Months, 1923 . ae as 


xi 


PAGE 
12 
15 
Wi 
19 
21 
22 
24 
26 
31 


33 


34 


35 


38 


39 


40 


41 
45 


47 


xii LIST OF CHARTS 
FIGURE PAGE 
19. Comparison of Scheduled and Actual Output (Cumula- 

tive) Speedwell Automobile Company, 1924 .... 48 
20. Comparison of Scheduled and Actual Output: Gantt 

Progress Chart. (Showing the situation on uae 

sth)... . 50 
21. Comparison of Scheduled oad Wetual’ Gurus Gantt 

Progress Chart. (Showing the situation on Septem- 

ber 30th) . .. 50 
22. Column Diagram: Distribution a 210 Employes: Clases 

fied on the Basis of Weekly Earnings. (Class-interval 

= $2.00) .... 74 
23. Column Diagram: Distnbation of 210 Employes Chee 

fied on the Basis of Weekly Earnings. (Class-interval 

Sralo0ya oo 15 
24. Column Diagram: Discibauon o 210 Employers Clase 

fied on the Basis of Weekly Earnings. (Class-interval 

= 8.50) . ; 76 
25. Column Dignan Diccibnton € 210 Baplovecs Chess 

fied on the Basis of Weekly Earnings. (Class-interval 

= $.25) . ne 
26. Frequency Balvcont Distabicion ae 210 Eemplowe 

Classified on the Basis of Weekly Earnings. (Class- 

interval = $2.00) . ; as 
27. Frequency Polygon: Ditabation of 210 Emplovect 

Classified on the Basis of Weekly Earnings. (Class- 

interval = $1.00) . . 718 
28. Frequency Polygon: Disteibution Ae 210 hen 

Classified on the Basis of Weekly Earnings. (Class- 

interval = $.50) ....: 79 
29. Column Diagram: Distribution of Personal Income Ree 

cipients in the United States, 1918. Including all Re- 

cipients of Incomes below $4,000. (Class-interval 

= $500) . ne 83 
30. Column Dina bacon of Borne Taceme Re 

cipients in the United States, 1918. Including all Re- 

cipients of Incomes below $4,000. (Class-interval 

= $200) . ae 84 
31. Column Diserscts Teodor of Personals income ee 

cipients in the United States, 1918. Including all Re- 


LIST OF CHARTS xili 
FIGURE PAGE 
cipients of Incomes below $4,000. (Class-interval 
= $100) . ive 85 
. Frequency Cure Dictabution oe Poronal Econ He 
cipients in the United States, 1918. Including all Re- 
cipients of Incomes below $4,000. (Derived from the 
column diagram with class-interval of $100) . . . . 86 
33. Cumulative Frequency Curve: Distribution of Tele- 
phone Poles Classified according to Length of Life. 
(Cumulated upward) ..... 91 
34. Cumulative Frequency Curve: Testeibution: of Tele- 
phone Poles Classified according to Length of Life. 
(Cumulated downward) . 92 
35. Distribution of Sawmills in the United States Clacined 
according to Labor Cost in 1920. Illustrating the 
Structural Relation between the Ogive and the Fre- 


quency Curve . . . 94. 
36. Frequency Curve: ices ution a 18, 780 Saitiere Giaee 
fied according to Height... . . 98 


37. Frequency Curve: Distribution of ibavane Bf Observation 99 
38. Zone of Dispersion, Artillery Firing, Showing the Theo- 


retical Percentage Distribution of Shots... . 100 
39. Column Diagram: Distribution of 1,000 Shots fronts a 

Single Gun . . 102 
40. Frequency Polygon: ipitnticn bot Hesdet in a Going 

Tossing Experiment. . 103 


41. Frequency Polygon: Disehibetion é 5, 540 "Cases G 
Change in the Wholesale Prices of Commodities from 


One Year to the Next. ... . 104 
42. Frequency Polygon: Distribution of Tmidott Nee York 

ixenangestates: aA tne, theo . . ek Se ee 106 
43. The Normal Curve of Beror Watery 108 
44. Illustrating the Location of the Median mith Gievouned 

Data (Personal Incomes of Seven Individuals) .. 121 


45. Distribution of Weekly Earnings of Employees. A 
Smoothed Frequency Curve, Showing the Relation 
between Mean, Median and Mode ... . 132 

46. Cumulative Distribution of Weekly Earnings of Fim 
ployees, Illustrating the Graphic Location of Median 
Bi UME Nes st ene eo urhiies ee ee he LOE 


XIV LIST OF CHARTS 
FIGURE PAGE 
47. Frequency Polygon: Distribution of Relative Prices of 
346 Commodities in 1914. (Average prices in 1913 
= 100). 180 
48. Frequency Polvon: sBiceeibwsion of Belative Priced of 
222 Commodities in 1900. (Average prices in 1890 
= 100) . 183 
49. Frequency Bely cow: STbiaerinaesion of Relutios Prives of 
1,437 Commodities in 1918. (Average prices July, 
1913 to June, 1914=100) .... 186 
50. Frequency Polygon: Distribution of Relative Priges of 
1,437 Commodities in 1918, with Relative Prices 
Plotted on Logarithmic Scale. (Average prices July, 


1913 to June, 1914=100) ..... 189 
51. Comparison of Five Simple Index Numbers f Ean 

Crop Prices, 1910-1923. (1910 = 100) . yt 202 
52. Comparison of Four Weighted Index Namnbars of Bae 

Crop Prices, 1910-1923. (1910 = 100). ‘ OAT 
53. A Graphic Representation of the Relation hewncer 

Certain Elements of the Price System. . . . 224 
54. New York Bank Clearings, 1860-1923, with Mayne 

Averages .. . 260 


55. Illustrating the Biting of a Sinaight Tine hs ‘Nive Danes Q73 
56. Illustrating the Fitting of a Second Degree Parabola to 


Nine Points .. . 283 
57. Business Failures in ee Wasted cine 1897-1921, Sh 
Three Lines of Trend... . 289 


58. Production of Crude Petroleum in the United, States, 
1908-1922, with Line of Trend Fitted to Logarithms 293 

59. Production of Crude Petroleum in the United States, 
1908-1922, with Line of Trend Fitted to Natural 


Numbers .. . 295 
60. Comparison of Actual ‘and Deflated ‘Values of Bane 
Contracts Awarded, by Months, 1914-1923 .... 313 


61. Frequency Distributions of Ratios of Monthly Egg 
Prices to the Corresponding Ordinates of Secular 


ETON ae te 325 
62. Illustrating the Method af Determiner Monthly Tend 
Values .. . 330 


63. Showing the Relation between trend Values of Total 


LIST OF CHARTS 


FIGURE 


64. 


65. 
66. 
67. 


68. 
69. 


70. 


aL, 


72. 


73. 


74. 
95. 


76. 


The 


78. 


Annual Production, Average Monthly Production and 
Production by Months 

Production of Bituminous Coal in the Uniteds Ghates: 
1913-1923 (by months) 

Cyclical and Accidental Tiaetiations in hBitammmtotts 
Coal Production in the United States, 1913-1923 . 

Composite Index of General Business Activity . 

Scatter Diagram showing the Relation between Taxable 
Personal Incomes and Motor Vehicle Registration, by 
States, in 1921, with Line of Average Relationship 

Tabulation of Items in a Correlation Table 

Scatter Diagram of Federal Reserve and Commercial 
Bank Rates, with Line of Average Relationship and 
Codevor Estiniate:(a mits 00) ae Riek abies f 

Showing the Relation hoteoen Diecut Rates of @oine 
mercial Banks and Federal Reserve Bank Discount 
Rates. (Commercial bank rates dependent) . 

Showing the Relation between Federal Reserve Bank 
Discount Rates and the Discount Rates of Com- 
mercial Banks. (Federal Reserve bank rates depen- 
dent) . ; 

Showing ie Relation hepneen Numer a Ware Bonners 
in Factories and Value of Products in Ten Selected 
Cities in the State of New York ; 

Showing the Relation between Number of Ware Rarers 
in Factories and Value of Products in Eleven Selected 
Cities in the State of New York 

Cotton Production in the United States, Chop Pati 
1900-1901 to 1922-1923, with Lines of Trend : 

Prices of Middling Upland Cotton in New York, Crop 
Years 1900-1901 to 1922-1923, with Lines of Trend 

Comparison of the Cyclical Fluctuations in Industrial 
Stock Prices and in General Business Activity, 1903- 
1923 ; 

Coefficients of Cinaeiation eae Tndes of Tadaririal 
Stock Prices and Index of Business Activity, 1903- 
1914, showing the Results Secured with Different 
Pairings . 

Coefficients of Coration beeen Inde of Thdustetat 


xv 
PAGE 


332 
339 
341 
356 


366 
380 


385 


395 


397 


407 


408 


413 


415 


423 


424 


XVv1 


LIST OF CHARTS 


FIGURE 


79. 


80. 


81. 


82. 


83. 


84. 


85. 


86. 


87. 


Stock Prices and Index of Business Activity, 1919- 
1923, showing the Results Secured with Different 
Pairings . 

Scatter Diagram shows the Relation benweam Alfalfa 
Yield and Irrigation Water Applied, with Two Lines 
of Regression ah 

Scatter Diagram chowine the Relanion Fete Wheat 
Yield and Nitrogen Applied as Fertilizer, with Straight 
Line of Regression and Line Joining the Means of the 
Columns ; 

The Relation Fetter ihe Producuan a Eee of Outs: 
Illustrating the Use of an Arithmetic Equation of 
Regression and Arithmetic Zones of Estimate ; 

The Relation between the Production and Price of Oats; 
Illustrating the Use of a Logarithmic Equation of 
Regression and Geometric Zones of Estimate : 

The Relation between the Production and Price of Oats; 
Illustrating the Use of a Logarithmic Equation of 
Regression and Geometric Zones of Estimate. (Plotted 
on double logarithmic paper) . 

The Relation between the Drachicrion and Price of Oars, 
Illustrating the Use of an Equation of Regression 
Based upon Reciprocals and of Harmonic Zones of 
Estimate : 

A Comparison of Actual ad ihearelies Bee ee in 
a Dice-rolling Experiment . 

Illustrating the Fitting of a Normal Curve to Freqierey 
Distribution of Telephone Subscribers, Classified Ac- 
cording to Message Use . 

An Illustration of the Measurement of Ape (Wider the 
Normal Curve . 


PAGE 
425 


433 


444 
A479 


480 


481 


483 


525 


536 


539 


STATISTICAL METHODS 


CHAPTER I 
STATISTICAL METHODS AND THE PROBLEMS 
OF ECONOMICS AND BUSINESS 


The term business as commonly used today covers 
many kinds of human activity. Since we are concerned 
with the development of methods of handling economic 
data and solving business problems it is appropriate that 
the particular types of business activity in connection with 
which statistical methods may be employed be briefly 
described. 


CuiassEes oF Business Activity 


The tasks which confront business men may, without 
undue straining, be placed in three classes. Many business 
men are concerned with but one type of activity, it is true, 
yet the world of business must deal with problems arising 
in all three of the fields to be outlined. First, in logical 
sequence, are the technical tasks which arise in the processes 
of production, involving problems of chemistry and physics, 
of engineering, of animal husbandry, of navigation. The 
basic technical knowledge called for in the solution of these 
problems furnishes the foundation of our economic life. 
This is the domain of the hard-won arts of handling the 
raw materials and controlling the forces of nature. 

In the second class come those activities which are con- 
nected with the internal organization and administration of 
individual business units. The technical functions of ma- 
nipulating organic and inorganic matter for the satisfaction 
of human wants are performed through administrative 
units, single farms, mines, factories, railroads, department 


stores. ‘A whole new division of problems is faced by the 
3 


4 STATISTICAL METHODS 


business man in organizing these units, in codrdinating the 
work of the different departments, in supervising the daily 
activities of the individuals making up each organization. 
While these are perhaps less fundamental than the tech- 
nical problems of production, they are, for the average 
business man, more pressing and more difficult. Scientific 
method has made less progress in solving these latter 
problems. There is not the organized body of knowledge 
which is found in the former field, nor are there the same 
trained experts to whom the tasks may be delegated. 

The two types of economic activity which have been 
named above include tasks which are in a sense self- 
centered and controllable. The manufacturer of steel has 
his technical problems of smelting and refining, his particu- 
lar administrative duties. The farmer or mine-owner faces 
the same types of problems, in forms peculiar to his own 
situation. In the performance of tasks in these fields each 
man is dealing with problems all the elements of which are 
under more or less perfect control. Difficulties arise, but 
these are ordinarily difficulties inherent in the given task, not 
difficulties arising from a sudden change in the constituent 
elements of the problem, or the sudden interjection of a 
new factor. In this respect the third category of tasks to 
be performed by the business man differs materially from 
the first two. For this class is composed of problems the 
most characteristic feature of which is that the elements 
are not subject to control by the individuals directly 
concerned. 

This third division includes buying and selling, and all 
the attendant activities which are carried on in terms of 
prices. As economic life is at present organized these 
functions are, to the business man, the most important 
ones he performs. The technical tasks of production and 
of internal organization and administration are but means 
to an end. For the business man the goal of economic 
activity is the disposal of his product at a profit. The 


THE PROBLEMS OF ECONOMICS AND BUSINESS 5 


tasks preliminary to this final sale are of necessity sub- 
ordinated to it, and so performed that the final aim may 
be achieved. ‘The point of emphasis here is that the busi- 
ness man, in buying and selling, faces problems containing 
elements which he cannot control. In securing his raw 
material, in bringing together the other agents needed in 
production, and in the final disposal of his product, the 
business man deals with markets — commodity markets, 
labor markets, money markets — and finds himself acting 
in relation to a system of prices quite beyond his control 
in its major movements. The other less fundamental 
phases of his activity are subject to a high degree of con- 
trol, but when the business man comes to the final and 
most important act, the profitable sale of his product, his 
power of control dwindles. The motivating force in busi- 
ness activity is the hope of pecuniary profits, pecuniary 
profits depend upon successful buying and selling, success- 
ful buying and selling depend upon favorable conditions in 
an uncontrollable world of prices — here is the argument 
which states the major problem of business. And these 
are the facts which make the price system the dominating 
and all-important factor in modern business life. 

The modern entrepreneur lives in an environment of 
prices. The term ‘“‘environment’” is not an unapt figure; 
this world of prices in which the business man functions 
constitutes a coherent, consistent, well-articulated system 
of interdependent parts, a system which encompasses all 
the business activities of the entrepreneur. Since the 
system is beyond the control of the individual he must 
adapt himself to it, and must base his activities upon as 
complete an understanding of the system as he may ob- 
tain. Without this understanding the major problems of 
business are incapable of solution. 


6 STATISTICAL METHODS 


QUANTITATIVE CHARACTER OF ECONOMIC AND BUSINESS 
PROBLEMS 


Problems falling in the first of the classes outlined above 
have long been recognized as essentially quantitative in 
character. Their solution calls for the application of the 
methods of precision which have been developed in the 
physical sciences. It is no less true that the strictly eco- 
nomic and business problems which fall in the other classes 
require the employment of quantitative methods. Quali- 
tative, non-measurable considerations may enter in the 
solution of such problems, but these factors generally rest 
upon a quantitative basis. Facts, measured, weighed and 
compared with other facts, constitute the basis of business 
judgments and the foundation of economic reasoning. 
Statistical methods are merely methods of measuring, 
weighing and comparing facts. 

Of the three classes of problems distinguished in the 
preceding section two come within the scope of the present 
discussion. ‘Though the methods of statistics are in part 
applicable to the solution of technical problems of pro- 
duction, it is not the purpose of the present work to de- 
velop this subject. For the solution of problems in the two 
other fields — those connected with the internal organi- 
zation and administration of business units and with the 
processes of buying and selling which bring the business 
man into contact with the price system — methods of 
statistical analysis are peculiarly appropriate. 


STATISTICAL METHODS AND PROBLEMS OF INTERNAL 
ADMINISTRATION 


The typical business man, in the administration of his 
organization, is called upon to deal with masses of quanti- 
tative data, quantitative in the sense that they are based 


THE PROBLEMS OF ECONOMICS AND BUSINESS 7 


upon measurements in terms of specific units. He is deal- 
ing with tons of coal, cubic feet of gas, or kilowatt hours of 
energy consumed; with tons of pig iron or pairs of shoes 
produced; with machine hours and man hours; with 
wages, costs of production and selling prices expressed in 
dollars and cents. With the increasing size of the business 
unit the data with which the administrator must deal be- 
come increasingly complicated and numerous, and it be- 
comes increasingly difficult to determine their true sig- 
nificance. Under intuitive or rule-of-thumb methods of 
administration it is impossible effectively to analyze large 
masses of data and to control business units above the 
average in size. It has been abundantly demonstrated that 
the law of decreasing returns comes into play in business 
largely because of administrative difficulties. 

Whenever one deals with masses of data the problem is 
one of condensation and analysis — condensation and sim- 
plification in order that it may be possible for limited 
human faculties to handle the data, analysis (and com- 
parison) in order that the elements in the problem may be 
distinguished and their significance appreciated. Statisti- 
cal methods have been developed to facilitate the con- 
densation and analysis of masses of quantitative data. 

As a typical example of such a problem may be men- 
tioned the allocation of costs, an operation which has been 
called cost accounting. The proper analysis of all the 
factors which enter into this problem is only possible 
through the use of statistical methods. Accounting meth- 
ods, necessarily restricted to the treatment of pecuniary 
units, are inadequate for the complete analysis of the items 
of expense. Quite apart, too, from the knowledge to be 
derived from an allocation of costs, expense items should 
be analyzed statistically, variations noted, comparisons 
made, and wastes discovered. The analysis of sales records, 
again, calls for the condensation of masses of data, their 
representation in simple, understandable form, and the 


8° STATISTICAL METHODS 


determination of the significance of the data. The analysis 
of markets, the study of purchasing records and of com- 
modities, require the use of quantitative methods not 
restricted in their application to any one class of measure- 
ments. At every hand in internal administration statisti- 
cal methods may be used to supplement accounting meth- 
ods, to extend the knowledge of the executive, and to make 
more effective the control of business operations. 


STATISTICAL METHODS AND EXTERNAL PROBLEMS 


New problems are encountered when the business man 
goes into the market to buy or sell. The price system, the 
movements of which are of such fundamental interest to 
the business man, requires analysis through the use of 
quantitative methods. So complex and numerous are the 
data to be dealt with here that simplification is imperative. 
Again, he is faced with the phenomena of the business 
cycle, and if he is to adapt his business policies to the 
swings of the cycle he must undertake the analysis of these 
phenomena, employing tools appropriate to the task. 
Apart somewhat from the immediate interests of the busi- 
ness man, but of great interest to the economist, are all the 
problems connected with the economic process of distri- 
bution, the allocation of income and wealth among the 
agents of production. These, as well as that other great 
economic problem concerned with the question of value or 
price determination, are quantitative problems, to be 
solved through the use of quantitative methods of research. 

What are these methods, and wherein does research em- 
ploying such methods differ from other types of research? 
Scientific inquiry, whatever its particular method may be, 
proceeds through careful observation, logical inference and 
_ aecurate verification. Quantitative methods differ from 
others only in that observation, inference and verification 
are based upon measurement, and are therefore more likely 


THE PROBLEMS OF ECONOMICS AND BUSINESS 9 


to be exact and accurate than are non-quantitative methods 
of analysis. Until measurement is possible in a science it 
is unavoidable that its observations and findings should 
lack precision, no matter how brilliant the flashes of intui- 
tion nor how painstaking the labors of its students may 
be. The employment of methods of measurement, making 
possible the analysis of the factors involved in terms of 
precise units, gives to a science all the advantages which 
sharp-edged tools have over blunt and unreliable instru- 
ments. Mathematics and its offspring, statistics and ac- 
counting, are the powerful instruments which the modern 
economist has at his disposal, and of which business, 
through the development of research agencies and meth- 
ods, is making constantly greater use. 

The tools of the statistician are merely certain mathe- 
matical methods, developed for particular types of research. 
These types of research were not economic in the original 
development of statistical methods, but social, political and 
anthropometric, with one line of development (that relat- 
ing to the theory of probabilities) extending back through 
the field of logic to the gaming table. Yet these tools, 
developed for work in restricted spheres, have been found 
to possess much wider applicability, and economics has 
been one of the newer fields in which the application of 
these methods, with appropriate alterations and additions, 
has had fruitful results. The economist has found his hand 
strengthened and the precision of his work materially in- 
creased by the new tools. And business, together with the 
more abstruse science of economics, has profited. 

It is perhaps unnecessary at this point to stress the 
limitations of the statistical method. For limitations it 
has, and limitations not always kept in view by those 
employing it. A discussion of this subject may be deferred 
to a later point. The fact may be emphasized at this stage, 
however, that the statistical method, as a tool, requires in- 
telligent usage, and that the results secured through statisti- 


10 STATISTICAL METHODS 


cal analysis require intelligent interpretation. Man has 
not yet invented a machine so automatically perfect that 
jumbled facts may be fed in at one end, while answers to 
problems flow out at the other. The faculties of intelligent 
reasoning and critical judgment have not become out-worn 
appendages because of the development of statistical 
methods of research. 


CHAPTER II 
GRAPHIC PRESENTATION 


The explanation of methods of condensing, analyzing 
and interpreting the facts of business and economics must 
start with the discussion of some fundamental considera- 
tions which are mathematical rather than statistical in 
character. In doing so it is deemed advisable, even at the 
risk of treading quite familiar ground, to explain certain 
mathematical conceptions to which constant reference will 
be made in later chapters. 

Statistical analysis is concerned primarily with data 
based upon measurement, expressed either in pecuniary or 
physical units. The methods of codrdinate geometry, de- 
veloped first by the philosopher Descartes, greatly facili- 
tate the manipulation and interpretation of such data. 
A summary of the basic principles of codrdinate geometry 
will not be out of place. 


RECTANGULAR COORDINATES 


If two straight lines intersecting each other at right 
angles are drawn in a plane, it is possible to describe the 
location of any point in that plane with reference to the 
point of intersection of the two lines. We will call one of 
the lines (a vertical line) Y’Y, the other line (horizontal) 
X’'X, and the point of intersection (or origin) O (cf. Fig. 1). 
If P be any point in the plane, we may draw the line PM, 
parallel to Y’Y and intersecting X’X at M, and the line 
PN, parallel to X’X and intersecting Y’Y at N. If we set 
OM equal to a units and ON equal to 6 units, a and b con- 
stitute the codrdinates of P, describing its location with 


respect to the origin, 0. Thus, in the figure, a equals 6 and 
11 


12 GRAPHIC PRESENTATION 


b equals 5. The distance a along the x-axis is termed the 
abscissa of the point P, while the distance b along the y-axis 
is termed the ordinate of the point P. (It is a rule of nota- 
tion always to give the abscissa first, followed by the 
ordinate.) The codrdinates of any other point in the same 
plane may be determined in the same way. Conversely, 


Fia. 1. — Location of a Point with Reference to Rectangular Codrdinates 


any two real numbers determine a point in the plane, if 
one be taken as the abscissa and the other as the ordinate. 

A point may lie either to the right or left or above or 
below the origin, O. It is conventional to designate as 
positive abscissas laid off to the right of the origin, and as 
negative abscissas laid off to the left of the origin, while 


GRAPHIC PRESENTATION 13 


ordinates are positive when laid off above the origin and 
negative when laid off below the origin. In general, the 
values to be dealt with in economic statistics lie in the 
upper right-hand quadrant, where both abscissa and ordinate 
are positive. 

This conception of codrdinates is fundamental in mathe- 
matics and of basic importance in statistical work. A very 
simple example will illustrate the utility of this device in 
representing business data. The figures presented in the 
following table may be employed. i 


TaBLe 1 
Price of Cotton, by Months, During the Year 1923 


(Average prices received by producers in the United States) 


Month Price per pound 
January ek oe 24.5 cents 
Hebrula nye ace ase. ct gee 25: Oma 
MMiarchiteyencetene niet oe Nt ef 
PN OU ete rch MERE ROP ee ae O84 un is 
May. 267 Os 
DUNE ae les ettd poyl ce mte on OmmeE 
Wily ee Ok ee aoe 26.2 « 
Agist et ae oe ee On mae 
Septemberaie san sceer ern ee ie 
OCio betes ae Ll ee 
INGYeEMmbersnnicin ose Oss eh 
Decemberiece: «cant oe SO 


(From Weather, Crops and Markets, Dec. 29, 1923.) 


These data may be represented graphically on the co- 
ordinate system, months being laid off along the z-axis and 
prices along the y-axis, as in the accompanying diagram 
(Fig. 2). In plotting the abscissas, January, 1923, is con- 
sidered as located at the point of origin. The z-value of 
the entry for January, 1923, is thus O, of the February 
figure 1, etc. The codrdinates of the point representing 
the price of cotton in January, 1923, are 0, 24.5; for 
February the values are 1, 25.9. The codrdinates for 
December are 11, 31.0. The trend of cotton prices during 


14 GRAPHIC PRESENTATION 


the year may be more easily followed if the points are 
connected by a series of straight lines, as is done in the 
figure. 


INDEPENDENT AND DEPENDENT VARIABLES 


In the location of any point by means of coérdinates it 
has been pointed out that two values are involved; every 
point ties together and expresses a relation between two 
factors. In the above case these are months and cotton 
prices. With the passage of time the’price of cotton changes, 
and the broken line shows the direction and magnitude of 
these changes. Both time and price are variables, that is, 
they are quantities not of constant value but characterized 
by variations in value in the given discussion. Thus in 
Fig. 1 the abscissa has a fixed value of 6, while the ordinate 
has a fixed value of 5, but in Fig. 2 both abscissa and ordi- 
nate have varying values, the one varying from 0 to 11, 
the other from 23.5 to 31.0. The symbols x and y are, 
by convention, used to designate such variable quantities 
as these, the former in all cases representing the variable 
plotted along the horizontal axis, the latter representing 
the variable plotted along the vertical axis. 

In Fig. 2, which depicts the changes taking place in 
cotton prices with the passage of time, it will be noted 
that the latter variable changes each time by an arbitrary 
unit, one month. Having made an independent change in 
the time factor we then determine the change in price 
taking place during the period thus arbitrarily chopped . 
out. The variable which increases or decreases by incre- 
ments arbitrarily determined is called the independent 
variable, and is always plotted on the a-axis. The other 
variable is termed the dependent variable, and is plotted on 
the y-axis. This dependence may be real, in the sense that 


1 Tt should be noted that letters at the end of the alphabet are used as symbols 
for variables, while letters at the beginning of the alphabet are used as symbols for 
constants, i.e. quantities the values of which do not change in the given discussion. 


GRAPHIC PRESENTATION 15 


the values of the second variable are definitely determined 
by the values of the independent variable, or it may be 
purely the conventional dependence of the type described. 


30 = lee ~. 


Cents per pound 


Fic. 2. — Cotton Prices in the United zi by Months, During the Year 1923 
(Monthly averages of prices received by producers) 


Time, it should be noted, is always plotted as independent, 
when it constitutes one of the variables. 


FuNcTIONAL RELATIONSHIP 


When the relationship between two variables is one of 
complete dependence, so that the value of y is uniquely 
determined by a given value of 2, y is said to be a function 
of x. The general expression for such a relationship is 
y= f(x). Thus the speed at which a body is falling at a 
given moment is a function of the time it has been falling, 
the pressure of a given volume of gas is a function of its 
temperature, the increase of a given principal sum of 
money at a fixed rate of interest is a function of time. If 
the values of the independent variable be laid off on the 


16 GRAPHIC PRESENTATION 


z-axis of a rectilinear chart and the corresponding values 
of the function (i.e. the dependent variable) be laid off on 
the y-axis, a graphic representation of the function will be 
secured, in the form of a curve.! This concept of functional 
relationship is a very important one in statistical work. 
Some of the simpler functions may be briefly discussed. 


Tue STRAIGHT LINE 


If two variables are so related that their values are 
always the same, their relationship is obviously of the 
form y=az. As a very simple example, the relation be- 
tween the age of a tree and the number of rings in its trunk 
may be considered. A tree 20 years old will have 20 rings, 
and so on. This relationship may be represented on a co- 
ordinate chart, several sample values of x and y being 
taken. When these points are plotted and a line drawn 
through them, we secure a straight line passing through 
the origin and (assuming the two scales to be equal) bi- 
secting the right angle XOY (cf. Fig. 3). 

Similarly, any equation of the first degree, (i.e. not 
involving zy, or powers of x or y above the first) may be 
represented by a straight line. The generalized equation 
can be reduced to the form y = a+ ba, where a is a con- 
stant representing the distance from the origin to the 
point of intersection of the given line and the y-axis, and 
6 is a constant representing the slope of the given line 
(that is, the tangent of the angle which the line makes 
with the horizontal). The constant term a is called the 
y-intercept. Tt is clear trom the generalized equation of the 
straight line that when zx has a value of zero, y will be 
equal to this constant term. In the example given above 
(Fig. 3) a is equal to 0, and 6b to 1. The location of a given 
line depends upon the signs of a and 6 as well as upon 


1 The general term “curve” is used to designate any line, straight or curved, 
when located with reference to a codrdinate system. 


GRAPHIC PRESENTATION 17 


their magnitudes. The practical problem involved in the 
determination of any straight line is that of finding the 
values of a and b from the data, a problem which will 


Fie. 3. — Graph of the Equation y = 2 


appear in various forms in the discussion of statistical 
methods. 

These points may be illustrated by the plotting of a 
simple equation of the first degree. Thus, to construct 
the graph of the function, y = 2+ 32, various values of 
2 are assumed, and corresponding values of y are de- 
termined. These may be arranged in the form of a 
table: 


18 GRAPHIC PRESENTATION 


x y 
(2 + 3a) 

hah — 10 

_—2 — 4 

0 2 

2 8 

4 14 


Plotting these values and connecting the plotted points, 
the graph illustrated in Fig. 4 is secured. It will be noted 
that since this function is linear (that is, the graph takes 
the form of a straight line) any two of the points would 
have been sufficient to locate the lme. The y-intercept is 
equal to the constant term 2, and the tangent of the angle 
which the given line makes with the horizontal (the slope 
of the line) is equal to 3, the coefficient of x. That this 
curve represents the equation is proved by the fact that 
the equation is satisfied by the codrdinates of every point 
on the curve, and that every pair of values satisfying the 
equation is represented by a point on the curve. It is 
characteristic of a linear relationship that if one variable 
be increased by a constant amount, the corresponding 
increment of the other variable will be constant. In the 
above case as x grows by constant increments of two, for 
example, the constant increment of the y-variable is six. 
Series which increase in this way by constant increments 
are termed arithmetic serves. 

Many examples of linear relationship between variables 
are found in the physical sciences. An example from the 
economic world is found in the growth of money at simple ~ 
interest, that is, interest which is not compounded. If we 
let r represent the rate of simple interest, 2 the number of 
years, and y the sum to which one dollar will amount at 
the end of x years, the equation of relationship is of the 


form y=lt+re 


Since in a given case r will be constant, this is of the simple 
linear type. In statistical work precise relationships of 


Fic. 4. — Graph of the Equation y = 3x + 2 


20 GRAPHIC PRESENTATION 


this type rarely if ever occur, but approximations to the 
straight line relationship are found constantly. 


Non-LinEAR RELATIONSHIP 


Non-linear functions are of many types, of which only 
a few of the more common will be discussed here. The 
student should be familiar with the general characteristics 
of the chief non-periodic curves, of which the parabolic 
and hyperbolic types, on the one hand, and the exponential 
type on the other, are the most important. The potential 
series is mentioned as a more general form of rather wide 
utility. Of periodic functions the sine curve is briefly 
described, as a fundamental form. 

Functional relationships of the parabolic or hyperbolic 
form are quite common in the physical sciences, and such 
curves are found to fit certain classes of economic data. 
The general equation, when there is no constant term, is 
of the form y= az’. The curve is parabolic when the 
exponent b is positive, and hyperbolic when b is negative. 
The two following examples will serve to illustrate these 
types: 


Problem: To construct the graph of the function y = 2”. 


x y 
(2?) 
a5 25 
-4 16 
ay 9 
2 4 
=] 1 
0 0 
1 1 
2 4 
3 9 
4 16 
5 25 


GRAPHIC PRESENTATION 21 


Problem: To construct the graph of the function y = 27}, 
for positive values of 2. 


The graph is shown in Fig. 5. 


Ney 
WA 
TINY 


ORC a Oy 2. GL 


Fic. 5. — Parabola: Graph of the Equation y = 2 


10 


22 GRAPHIC PRESENTATION 


y 
(2) 
3 


Or 09 2 ei lH ole 
OH lH col balet ed DO 


The graph of this function, an equilateral hyperbola, is 
shown in Fig. 6. It should be noted that this equation 


3.0 


0 5 1.0 1. 


Fie. 6. — Equilateral Hyperbola: Graph of the Equation y = a7 
(for positive values of 2) 


wu 
re) 
o) 
bed 
wn 
S 


: 1 
may also be written y = OP Ry 1. 


GRAPHIC PRESENTATION 23 


It is characteristic of relationships of this type that as 
x increases in geometric progression, y also increases in 
geometric progression. Thus, in the example of the 
parabola given above (y = 2’), if we select the x values 
which form a geometric series,! the corresponding y 
values form a similar series: —/v*' ; 


Ng 


x 1 2 4 8 16 32 
y 1 4 16 64 256 1024 


Another class of functions is of the form represented by 
the equation y = ab’. In equations of this type one of the 
variable quantities occurs as an exponent; graphs repre- 
senting such equations are called exponential curves. The 
example which follows illustrates the type. 

Problem: To construct the graph of the function y = 2’, 
for positive values of z. 


x y 
(27) 

ij 

Q 


aoa fF 0 OK © 
CO > 


This graph is shown in Fig. 7. 

It has been noted that the relationship between two 
variables which increase by constant increments (consti- 
tuting arithmetic series) may be represented by a straight 
line, and that the relationship between variables increasing 
in geometric progression may be represented by either a 
parabola or a hyperbola. The exponential curve constitutes 
a hybrid type. It describes a relation in which one variable 
increases in arithmetic progression while the other in- 


1 A geometric series is one each term of which is derived from the preceding 
term by the application of a constant multiplier. 


Fic. 7. — Exponential Curve: Graph of the Equation y = 2% 
(for positive values of 2) 


GRAPHIC PRESENTATION 25 


creases in geometric progression. The figures given above 
illustrate this relationship. 

Curves based upon relationships of the following type 
have been employed extensively in statistical inquiries: 

y=at+br+cxr?+de+.. 

The term potential series has been applied to equations of 
this type. ‘Though such curves do not constitute parabolas 
of the strict conic section type, a curve based upon such 
an equation carried to the second power of 2 is termed a 
second degree parabola, to the third power of 2, a third 
degree parabola, ete. No uniform and simple type is 
secured from this series. It is treated in more detail at a 
later point. 

Periodic functions constitute another distinct type, a 
class represented notably by electrical and meteorological 
relations, though not confined to these fields. The charac- 
teristic feature of such relationships is that values of the 
dependent variable repeat themselves at constant intervals 
of the independent variable. The sine curve, the basic 
type of this class, is illustrated in the following example. 

Problem: To construct the graph of the function y = sina. 


2 
(angle in degrees) (sin x) 

0° .000 
Ue .500 
60° . 866 
90° 1.000 
120° . 866 
150° .500 
180° . 000 
210° — .500 
240° — .866 
270° — 1.000 
300° — .866 
330° — .500 
360° .000 
390° .500 


etc. 


26 GRAPHIC PRESENTATION 


The graph is shown in Fig. 8. 


“AK 60" 120° 180" 240° 300" 


Fie. 8. — Sine Curve: Graph of the Equation y = Sin x 


560° 


The full importance in statistical work of securing a 
mathematical expression for the relation between two vari- 
ables cannot be demonstrated until the subject has been 
further developed. One fundamental object is the deter- 
mination of physical or economic laws underlying observed 
phenomena. Another more practical object is the securing 
of a formula by means of which values of one variable may 
be approximated from given values of the other. Examples 
throughout the book will serve to illustrate how these 
objects are attained. 


LoGARITHMS 


Logarithms, which play such an important part in 
general mathematical operations, are of equal importance 
in the manipulation of the raw materials of statistics. 


1 A fuller discussion of different curve types is presented below, in the section 
dealing with the analysis of time series. 


GRAPHIC PRESENTATION 27 


The nature of logarithms, and the methods by which they 
are employed to facilitate arithmetic processes, may be 
briefly reviewed. This discussion is concerned only with 
the common system of logarithms of which the base is 10. 

Any positive number may be expressed as a power of 10. 


Thus 
1,000 = 10 x 10 x 10 = 10° 
10,000 = 10 x 10 x 10 x 10 = 104 


In each case the exponent of 10 (the small number written 
above and to the right) indicates the number of times the 
figure 10 is repeated as a factor. For the integral powers 
of 10 the exponent is a whole number, but for the other 
numbers the exponent will contain a fractional value. 
Thus 100 is equal to 10 raised to the power 2, or 10?; 110 is 
equal to 10 raised to the power 2.04139, or 107-9199, 

The exponent of 10, or the index of the power to which 
10 must be raised to equal a certain number, is called the 
logarithm of that number. The logarithm of 100 is 2, the 
logarithm of 110 is 2.04139, the logarithm of 998 is 2.99913. 
These figures all have reference to the base 10, though a 
system of logarithms might be developed on any base. In 


general, if 
a=0° 


log, a =c¢ 


which may be read “the logarithm of a to the base b is 
equal to c.”” The relation between the given number, the 
base and the logarithm, when the common system of 
logarithms is employed, may be easily remembered if the 
following relations are kept in mind: 
100 = 10? 
logio 100 = 2 


The logarithm of any number has two parts, the integral 
and the decimal. The whole number is called the charac- 
teristic, and the decimal portion is termed the mantissa. 
The former is determined in a given case by inspection, 


28 GRAPHIC PRESENTATION 


while the mantissa may be obtained from logarithmic 
tables. The characteristic varies with the location of the 
decimal point, while the mantissa remains the same for 
any given combination of numbers. This fact is illustrated 
by the following figures: 


log of 8450 = 3.92686 
log of 845 = 2.92686 
logof 84.5 = 1.92686 
logof 8.45 = .92686 
log of 845 = 9.92686 — 10 
log of 0845 = 8.92686 — 10 


In finding the natural number to which a given logarithm 
corresponds (such natural numbers are termed antt-loga- 
rithms), the mantissa determines the sequence of figures, 
while the whole number, or characteristic, determines the 
location of the decimal point. For example, in seeking the 
anti-logarithm of 2.17609 it is found that the decimal 
.17609 follows the natural number 1500, in a table of 
logarithms. Since the characteristic is 2, the natural 
number desired must lie between 100 and 1000, and must 
therefore be 150. 

A brief study of the following figures, showing the pro- 
gression of numbers corresponding to certain powers of 10, 
will help to fix in mind the relations between the multiples 
of 10 and their logarithms, and will enable the characteristic 
of a desired logarithm to be readily determined. 


.0001 .001 .0O1 .1 1 10 100 1,000 10,000 
10 LO 10-4" TO 10% 210! 1 ao 104 
The exponents of 10 in the lower row are the logarithms 
of the numbers in the upper row. 

It should be noted that the logarithms of all numbers 
from 0 to 1 are negative. Thus the logarithm of .845 is 
— 1+ .92686; this is written 9.92686 — 10. In cover- 
ing the range of all positive natural numbers from zero to 
infinity, logarithms traverse all positive and negative 


GRAPHIC PRESENTATION 29 


values. A negative natural number, therefore, can have 
neither a positive nor a negative logarithm. 

The advantage of thus expressing numbers as powers of 
10 lies in the fact that the ordinary arithmetic operations 
of multiplication, division, raising to powers and extracting 
roots are greatly facilitated by this procedure. 

To multiply numbers, add their logarithms. The sum 
of the logarithms of the factors is the logarithm of their 
product. In general terms: 


OSC a = 027") 


Specifically 
10? x 10? = (10 x 10) x (10 x 10 x 10) = 10° = 100,000 
100 x 1,000 = 100,000 


To divide one number by another, subtract the loga- 
rithm of the latter from the logarithm of the former. The 
remainder is the logarithm of the desired quotient. 


In general terms: 
a® + at = al) 


Specifically 
1010 MOS A050 
Ch 2 eee 
10° + 10 10 x 10 10 1,000 
100,000 + 100 = 1,000 


To raise a given number to any power, multiply the 
logarithm of the number by the index of the power. The 
product is the logarithm of the desired power. 


In general terms: 
(a®)e = a’ 


Specifically 
(10°)? = (10 x 10 x 10) x (10 x 10 x 10) = 10° = 1,000,000 
1,0002 = 1,000,000 


To extract any root of a given number, divide the loga- 
rithm of the number by the index of the root. The quo- 
tient is the logarithm of the desired root. 

In general terms: 


Va = als) 


30 GRAPHIC PRESENTATION 


Specifically 
3 6 
~/10° = 103 = 10? = 100 
1,000,000 = 100 


In summary: 
log (a x b) = loga + log b 
log (a + b) = log a — log b 
log a’ = 6b x loga 
log v/a = loga +b 


These characteristic advantages of logarithms have been 
made use of in the construction of the slide rule, an instru- 
ment for reducing routine toil which should be familiar to 
all students of statistics. 


LOGARITHMIC EQUATIONS 


The graphic representation of data by means of a system 
of rectangular codrdinates has been described above and 
some of the advantages of this method have been outlined. 
For many purposes it is desirable to plot logarithms rather 
than the natural numbers themselves. This may result in 
bringing out significant relations more distinctly, or it may 
serve greatly to simplify and facilitate the manipulation of 
data. In particular, when it is possible through the use of 
logarithms to reduce a complex curve to the straight line 
form, a distinct gain has been made in the direction of 
simplicity of treatment and interpretation. 

A linear equation, it will be recalled, is of the general 
form y=a+ bx, where a and b are constants which 
measure, respectively, the y-intercept of the given line and 
the slope. The simplification of equations through the use 
of logarithms involves in all cases the substitution of 
log x or log y, or both, for the 2 or y variables, thereby 
reducing an equation of a higher order to a simpler form. 

This process may be illustrated with reference to the 
equation y = 27. When plotted on rectangular codrdinates 
this equation gives a curve of the parabolic type (cf. Fig. 5). 


GRAPHIC PRESENTATION 31 


Reduced to logarithmic form this becomes log y = 2 log z. 
This equation, in which the variables are log y and log 2, 
is linear in form. It is plotted in Fig. 9, for positive values 


Natural Numbers 
4 
onnende fe 8 1G? 40252 64 


210721 128 
180618 64 
2) 
5 
E 150515 528 
5 E 
4 a 
1.20412 16 
6 5 
2 te} 
3 9030S 8a 
60206 4 
90103 


1 
(9 3010860206 750809 120412 150515 180618 
Scale of Logarithms 


Fic. 9. — Graph of the Equation log y = 2 log x. (Logarithmic form 
of the equation y = 2”) 
of log x. To indicate the relations involved, natural numbers 
corresponding to the logarithms are given on scales to the 
right and at the top of the figure. The natural numbers 


32 GRAPHIC PRESENTATION 


appearing on the scales constitute geometric series, while 
their logarithms form arithmetic series. Equal vertical 
distances on the chart, it will be noted, represent equal 
absolute increments on the scale of logarithms and equal 
percentage increments on the scale of natural numbers. 

The equation y = 52° can be reduced in the same way 
tology = log 5 + 3 logz,a linearform. Similarly, all equa- 
tions of the type y = az, that is to say, all simple parabolas 
and hyperbolas, can be reduced to the straight line form 
logy = loga + blog x Graphically this means plotting the 
logarithms of the y’s against the logarithms of the z’s. 

A different problem is presented by an equation of the 
type y = ab’, the graph of which is termed an exponential 
curve. Expressed in logarithmic form, we have log y = log 
a+ log b. This is also of the linear type, the two con- 
stants being log a and log b, while the variables are x and 
log y. If we plot the natural z’s and the logs of the y’s, 
with this type of equation, a straight line will be secured. 
A curve of this type is discussed and illustrated below. 


LoGARITHMIC AND SEmMI-LOGARITHMIC CHARTS 


There are certain disadvantages to the plotting of loga- 
rithms, however. If a considerable number of points are 
being plotted the task of looking up the logarithms may be 
tedious, and, in addition, the original values, in which 
chief interest lies, will not appear on the chart. These 
difficulties may be avoided by constructing charts with the 
scales laid off logarithmically, but with the natural numbers 
instead of the logarithms appearing on the scales. This is 
an arrangement identical with that employed in the con- 
struction of slide rules. Thus, although the natural numbers 
are given on the scales, distances are proportional to the 
logarithms of the numbers thereon plotted. In Fig. 10 
such a chart is presented, showing the graph of the equa- 
Lion svya= 0 


GRAPHIC PRESENTATION 33 


A variation of this type of chart which is of great im- 
portance in statistical work is one which is scaled arith- 
metically on the horizontal axis and logarithmically on the 
vertical axis. This is equivalent, of course, to plotting the 


Fic. 10. — Graph of the Equation y = x. (Plotted on paper with 
logarithmic scales) 


x’s on the natural scale and plotting the logarithms of the 
y’s. As was pointed out above, such a combination of 
scales reduces a curve of the exponential type to a straight 
line. Plotting paper of this semi-logarithmic or ‘‘ratio’’ 
type may be constructed with the aid of a slide rule or 
of logarithms, or may be purchased ready made. It is of 
particular value in charting economic statistics, because of 


34 GRAPHIC PRESENTATION 


the fact that time is usually one of the variables in such 
cases, and it is desirable to plot this variable on the natural 


scale. 
As an example of this type of curve the compound inter- 


Dollars 
350 


TEC 
caInGITE 


200 


al 


Fia. 11.— The Compound Interest Law: The Growth of $10.00 at Compound 
Interest at 6% for 100 Years. (Plotted on the arithmetic scale) 


est law may be used. If r be taken to represent the rate 
of interest, x the number of years, p the principal, and y 
the sum to which the principal amounts at the end of z 


GRAPHIC PRESENTATION 35 


years (interest being compounded annually), an equation is 
secured of the form 


= pl er)? 
Expressed logarithmically this becomes 
log y = log p+ 2 log (1 + 7) 
the equation to a straight line. 
In Fig. 11 a curve representing the growth of ten dollars 
Dollars 
10,000 


Nu AN TT 
MIT LT 


UIE CATING UT 
UE EL NGI 


NTI 
bE 


vain 
UE LENT LUT 


TIE EE ETPNG EET 
gS N 


a UIING TUTE TTT 
a LINE NIE [LT 


AN TTT 


105 


vo) 
Years 


Fic. 12. — The Compound Interest Law: The Growth of $10.00 at Compound 
Interest at 6% for 100 Years. (Plotted on the semi-logarithmic or ratio scale) 


at compound interest at 6 per cent is plotted on the natural 
scale. This is the graph of the exponential equation 

y = 10 (1 + .06)? 
y representing the total amount of principal and interest 
at the end of z years. Figure 12 shows the same data 


36 GRAPHIC PRESENTATION 


plotted on semi-logarithmic paper, the exponential curve 
being reduced to a straight line. 

The use of semi-logarithmic paper is not confined to 
cases in which an exponential curve is straightened out, 
for the significance of many types of data is most effectively 
brought out when charts of this type are used. These 
advantages are more fully explained below. 


THE CONSTRUCTION OF CHARTS 


When the results of observations or statistical investi- 
gations have been secured in quantitative form, one of the 
first steps toward analysis and interpretation of the data 
is that of presenting these results graphically. Not only is 
such procedure of scientific value in paving the way for 
further investigation of relationships, but it serves an 
immediate practical purpose in visualizing the results. A 
visual stimulus opens up a far more direct path to our 
understanding and imagination than that afforded by the 
more recently developed processes of reasoning. The inter- 
pretation of a column of raw figures may be a difficult task; 
the same data in graphic form may tell a simple and easily 
understood story. For these reasons graphic methods of 
presentation have come to play a highly important part in 
the everyday activities of business, as well as in the labo- 
ratory and drafting room. 

It is beyond the scope of this book to present any detailed 
account of the multiplicity of graphs employed by engineers 
and statisticians today. Certain of the more important 
principles of graphic presentation may be briefly explained, 
however, and some of the chief types of graphs which are 
in daily use may be illustrated. Other examples appear in 
later chapters of this book. 


Factors GOVERNING THE SELECTION OF A CHART 


The selection of the type of chart to be employed in a 
given case will depend upon two general considerations. 


GRAPHIC PRESENTATION 37 


The first of these relates to the character of the material 
to be plotted. While the data of a given problem may 
frequently be presented graphically in several different 
forms, there is generally one type of chart best adapted to 
that material. It may be true, also, that certain types 
would be quite inappropriate to the data in question. The 
selection of a type of chart to employ, therefore, must be 
made with the characteristics of the data clearly in mind. 

Perhaps more important is the purpose which the given 
chart is designed to serve. Each of the many types of 
charts in common use is appropriate to certain specific 
purposes. It will bring out certain characteristics of the 
data or will emphasize certain relationships. There is no 
chart which is sovereign for all purposes. Until the purpose 
is clearly defined the best chart form can not be selected. 
The following descriptions of a few standard types will 
facilitate the selection of an appropriate form. 


Cuarts ADAPTED TO THE PLOTTING oF TIME SERIES 


In the graphic presentation of a time series, primary 
interest attaches to the chronological variations in the 
values of the data, to the general trend and to the fluctua- 
tions about the trend. If the purpose is to emphasize the 
absolute variations, the differences in absolute units between 
the values of the series at different times, a simple chart 
of the type illustrated in Fig. 13 will serve the purpose. 
This chart depicts annual wheat flour exports from the 
United States during the period 1901-1923. Both scales 
are arithmetic. Points representing the various annual 
values are shown and, to facilitate interpretation, these 
points are connected by a series of straight lines. The 
chart tells a simple story of year-to-year fluctuations, with 
a general downward trend to 1910, an upward trend from 
1911 to 1919, and a decline since 1919. With respect to 
general make-up, the following points should be noted: 


38 GRAPHIC PRESENTATION 


1. The title constitutes a clear description of the material plotted 
and indicates the period covered. 

2. The vertical scale begins at the zero line, enabling a true 
impression to be gained of the magnitude of the fluctuations. 

3. The zero line and the line joining the plotted points are ruled 
more heavily than the coérdinate lines. 


Millions of barrels 


nN 
assesses 8 


1912 
19 
19k 
19 
19 


BRRERRRRE ZB ES 


4. Figures for the scales are placed at the left and at the bottom 
of the chart. The vertical scale may be repeated at the 
right to facilitate reading. All figures are so placed that 
they may be read from the base as bottom or from the 
right hand edge of the chart as bottom. 

5. The y-values of the plotted points are given at the top of the 
chart. This practice is helpful, though not necessary, as 
the values may be presented in a separate table. 


ADVANTAGES OF THE Ratio CHART 


If relatwe rather than absolute variations are of chief 
concern, the chart employed should be of the semi-loga- 
rithmic type, scaled logarithmically on the y-axis and arith- 


GRAPHIC PRESENTATION 39 


metically on the x-axis. In such a chart equal percentage 
variations are represented by equal vertical distances, as 
opposed to the ordinary arithmetic type in which equal 
absolute variations are represented by equal vertical dis- 
tances. The argument for the use of the semi-logarithmic 
or ratio chart for the representation of time series is that, 


Long Tons 
Millions 

10 
6 


Fic. 14. — Production of Steel in the United States, 1896-1922 
(Plotted on the semi-logarithmic scale) 


in general, the significance of a given change depends upon 
the magnitude of the base from which the change is meas- 
ured. That is, an increase of 100 on a base of 100 is as 
significant as an increase of 10,000 on a base of 10,000. In 
each case there is an increase of 100%. The absolute in- 
crease in the second case is 100 times that in the first case, 
and the two changes would show in this proportion on the 


40 GRAPHIC PRESENTATION 


arithmetic chart. They would show as of equal importance 
on the semi-logarithmic chart. 

Such a chart is presented in Fig. 14, which shows the 
course of steel production in the United States from 1896 
to 1922. The absolute magnitudes are plotted, but the 
vertical scale is so constructed as to represent variations 
from year to year in proportion to their relative magnitude. 


Dollars 
10,000,000 


9,000,000} 


ze 8 
Sone 
Aas 
eel ow 


0 ae ee oo 
ine) 


z 


= 


9 
9 
1912 
19 


Fie. 15. — Sales of the Acme eee 1910-1923. Showing the Total Sales 


in the United States and the Sales in Certain Subdivisions. (Plotted on 
the arithmetic scale) i 


Certain distinctive advantages of the ratio or logarithmic 
ruling are brought out by a comparison of Fig. 15 and 
Fig. 16. The data presented graphically in these two 
charts are the following: 


GRAPHIC PRESENTATION 41 


Dollars 


Fic. 16. — Sales of the Acme Corporation, 1910-1923. Showing the Total Sales 
in the United States and the Sales in Certain Subdivisions. With Scales of 
Increase, Decrease and Comparison. (Plotted on semi-logarithmic paper) 


42 GRAPHIC PRESENTATION 


TABLE 2 
Sales of the Acme Corporation, 1910-1923 


Total 
Middle os) 
Penn. New Jersey | New York Adaic United 
States 
States 
I9LORE $305,000 | $105,000 $465,000 | $ 875,000 | $5,600,000 
1911.....] 310,000 100,000 480,000 890,000 | 5,400,000 
1912... 400,000 200,000 500,000 1,100,000 | 6,000,000 
UG 6 4 ax 425,000 250,000 575,000 1,250,000 | 5,200,000 
LOT. 300,000 125,000 465,000 890,000 | 6,400,000 
1915.....} 400,000 130,000 600,000 1,130,000 | 8,200,000 
1916.....} 700,000 300,000 800,000 1,800,000 9,700,000 
Oia 760,000 350,000 740,000 1,850,000 | 8,300,000 
ONG H ane 630,000 320,000 750,000 1,700,000 | 8,100,000 
1919.....} 650,000 400,000 775,000 1,825,000 | 9,200,000 
1920..... 900,000 500,000 1,050,000 | 2,450,000 | 10,000,000 
GPA soaks 500,000 250,000 750,000 1,500,000 | 6,500,000 
1920 eee 650,000 300,000 950,000 1,900,000 | 8,000,000 
IO so oes 750,000 425,000 1,025,000 2,200,000 | 9,500,000 


If the five series are to be presented on a single chart, 
scaled arithmetically, a scale must be selected which will 
include the largest item recorded, $10,000,000. Such a scale 
reduces the relative importance of the smaller magnitudes. 
From Fig. 15 it appears that during the period covered by 
the chart very considerable fluctuations occurred in the 
sales for the country as a whole, that minor fluctuations 
occurred in the total for the Middle Atlantic States, and 
that the sales in the three individual states represented 
remained practically constant. Such a picture is quite mis-’ 
leading. The true state of affairs is reflected in Fig. 16, in 
which the same data are plotted on paper with a semi-loga- 
rithmic ruling. The fluctuations in the individual states 
now appear to have been relatively greater than those in the 
total for the country as a whole. For the purpose of com- 
paring series which differ materially with respect to the 
magnitude of the individual items, the arithmetic ruling 
is quite useless, giving a thoroughly distorted picture of 


GRAPHIC PRESENTATION 43 


the true relations. The ratio ruling permits a legitimate 
comparison. 

The scales printed below Fig. 16 emphasize certain very 
useful features of the logarithmic ruling. The scale of 
increase may be used to measure with a fair degree of accu- 
racy the increase in a given series between any two dates. 
A given vertical distance on the chart, it will be recalled, 
represents a constant percentage increase at all points on 
the chart. Thus the distance from $100,000 to $200,000, 
along the vertical scale, is the same as the distance from 
$2,000,000 to $4,000,000. Any vertical distance may be 
measured, and the percentage of increase which it repre- 
sents may be determined by laying off the given distance 
along the scale of increase, which is always read from the 
bottom up. For example, to determine the degree of in- 
crease in New Jersey sales from 1912 to 1913, we measure 
the vertical distance between the points plotted for these 
two years. Laying off this distance along the scale, it is 
found to represent a 25 per cent increase. 

The scale of decrease is used in a similar fashion. The 
vertical distance between any two points is measured, and 
the percentage decrease which it represents is determined 
by laying off the given distance on the scale from the top 
downward. The arrows indicate the direction in which the 
various scales are to be read. 

By means of the scale of comparison the percentage rela- 
tion of one series to another at any time may be deter- 
mined. For example, we may wish to know the percentage 
relation between Middle Atlantic sales and total sales in 
the United States in 1912. The vertical distance between 
the two plotted points is measured, and laid off on the scale 
of comparison, reading from the top downward. It is 
found to be approximately 18 per cent. 

Scales of the type illustratedyabove may be readily con- 
structed on a given chart by using the ratio ruling for the 
scale intervals. When a series of charts is constructed on 


4A GRAPHIC PRESENTATION 


semi-logarithmic paper of a standard type it is more con- 
venient to construct such scales in a more permanent form, 
in the shape of special rulers. 

The chief advantages of the semi-logarithmic ruling in 
chart construction may be briefly summarized: 


1. A curve of the exponential type becomes a straight line when 
plotted on a semi-logarithmic chart. For example, a curve 
representing the growth of any sum of money at compound 
interest takes the form of a straight line when so plotted. 

2. In any series, so long as the rate of increase or decrease remains 
constant the graph will be a straight line on this ruling. 

3. Equal relative changes are represented by lines having equal 
slopes. ‘Thus two series increasing or decreasing at equal 
rates will be represented by parallel lines. 

4. Comparison of the rates of change in two or more series is 
effected by comparison of the slopes of the plotted lines. 

5. The semi-logarithmic ruling permits, at the same time, the 
plotting of absolute magnitudes and the comparison of 
relative changes. 

6. Comparison of series differing materially in the magnitude of 
individual items is possible with the semi-logarithmic chart. 

7. Percentages of change may be read and percentage relations 
between magnitudes determined directly from the chart. 


CHARTS FOR THE COMPARISON OF FREQUENCIES 


A different type of chart is called for when the object is 
the comparison of frequencies, that is, numbers of events 
or things of different classes. The following census figures 
may serve to illustrate the problem. ; 


TABLE 3 

Farms in New England States in 1920 

State Number of farms 
Mia Ge as cc onuscces le te eee 48,227 
New Hampshire............... 20,523 
Vermontieaae ato ie ree 29,075 
Massachusettcnen ant tenirecnte 32,001 
Rhodesislandes.2: tee 4,083 


@GnnectiCUiewees. sen ee 22,655 


GRAPHIC PRESENTATION 45 


A graphic comparison of these six states with respect to 
number of farms in 1920 is afforded by the bar diagram in 
Fig. 17. This is a simple but effective type of chart for this 
purpose. 

Further examples of this type of chart, as employed in 
the representation of frequency distributions, are contained 


10000 


Maine New Vermont Massachusetts Rhode Connecticut 
Hampshire Island. 
Fie. 17. — Farms in New England States in 1920 


in the next chapter. It is there shown how a frequency 
polygon or frequency curve may grow out of the simple 
bar diagram, when data of certain kinds are being handled. 
Such frequency curves constitute very important graphic 
types, but it will be more appropriate to treat them in 
full at a later point. 


CHARTS FOR THE REPRESENTATION OF COMPONENT Parts 


It is frequently desirable in graphic presentation to break 
up a total into its component parts, in order that changes 
in the parts as well as in the total may be followed. The 
following table constitutes a simple example of such data. 


46 GRAPHIC PRESENTATION 


TABLE 4 
Production Costs of the XYZ Corporation, 1923 
(The figures show the total and subdivided cost per unit produced) 


Month Material Cost | Labor Cost | Overhead Cost| Total Cost 
JanUaTyerr sae $32.00 $12.00 $6.00 $50.00 
February........ 31.00 11.00 6.00 48 .00 
iIMarchteseworeccs 33.00 12.00 6.50 51.50 
Nail 6G oo ce aT ooe 31.00 13.00 6.00 50.00 
IM ai tape eteccus arene 27 .00 13.00 7.00 47.00 
JUNE psec eee ores 24.50 13.50 7.00 45.00 
Sully f8 suave eee 23 .00 16.00 7.00 46.00 
AteUStE: ese its 22.00 18.50 8.00 48 .50 
September....... 23.50 18.00 7.50 49.00 
OMctobernemusewes 24.00 16.00 7.50 47.50 
November....... 24.00 17.00 8.00 49 .00 
December....... 24.50 17.50 8.50 50.50 


These figures are presented graphically in Fig. 18. It is 
clear from this diagram that, while total costs have been 
relatively stable, certain of the elements in this total cost 
have fluctuated quite widely. A knowledge of the total 
alone is inadequate, and must be supplemented by knowl- 
edge of the changes in the component items. Such a chart 
as this presents a clear picture, not only of changes in the 
total, but of the movements of each of the constituent 
parts. 

CUMULATIVE CHARTS 


In many cases chief interest in the development of a 
series attaches not to the value of each successive item but 
to the cumulated total of a number of such items. This 
may be so when a yearly production program has been 
laid out. In such a case it is the relation between cumu- 
lated production to date and scheduled production to date 
which is of major interest, and a chart form is needed 
which will enable this comparison to be made. The fol- 
lowing figures illustrate the type of data for which such 
charts are appropriate. 


GRAPHIC PRESENTATION 47 


Fic. 18. — Elements of Production Costs of the XYZ Corporation 
By Months, 1923 


TABLE 5 
Cumulative Production Schedule and Cumulative Output, 1924 


Speedwell Automobile Company 


Cumulative : 
Production : Cumulative 
Month Sehedude pepe Output Output 
chedule (cars) 

(cars) (cars) (cars) 

MANUAT VEN ae eyo ois 2,000 2,000 750 750 

February........ 2,000 4,000 1,250 2,000 

Wiancliavarincts sicee 2,000 6,000 1,250 3,250 

PANEM ae rete fonet ss ae 3,000 9,000 2,000 5,250 

May ees 3,000 12,000 1,500 6,750 

DUNE. +. 0 Aree 3,750 15,750 3,750 10,500 

July re staan eie 3,750 19,500 4,250 14,750 

INO USU ae ie sic icici 5,000 24,500 6,000 20,750 

September....... 4,000 28,500 5,750 26,500 
Octobersene sa. 4,000 32,500 
November....... 4,000 36,500 


December....... 4,000 40,500 


48 GRAPHIC PRESENTATION 


It is assumed that this table represents the situation as 
of the end of September. 

In Fig. 19 the two cumulative curves are plotted. The 
relation between actual and scheduled production at the 


O'Tan Tes Apr. May dune duly Aug. Sept Oct. Nov. Dec. 


Fic. 19. — Comparison of Scheduled and Actual Output (Cumulative) 
Speedwell Automobile Company, 1924 


end of each month is shown on the chart, and it is possible 
from the scale to read the approximate amount by which 
production is behind schedule. By reference to the figures, 
which should always accompany the chart, the exact rela- 
tion may be determined. Such a chart has many appli- 
cations, some of which are illustrated in the following 
chapter. 


Tue Gantt Procress CHart 


The same data may be presented in a very effective form 
by making use of a type of chart developed by Mr. H. L. 
Gantt. An adequate description of this chart and of its 


GRAPHIC PRESENTATION 49 


many uses would far exceed the space which can be given 
to it here, but its characteristics may be indicated in a 
very brief account. 

Once a schedule has been drawn up, the Gantt chart 
may be utilized in checking actual accomplishment against 
the schedule. Having such a schedule as that given in 
Table 5, the monthly and annual quotas may be entered 
on a form similar to that shown in Fig. 20. The entry to 
the left of each monthly space indicates the amount sched- 
uled for production during that month. The entry to the 
right of each monthly space indicates the cumulated sched- 
uled production to the end of the given month. In this 
figure the results of the first two months’ operations are 
shown. The heavy black line indicates the cumulated 
actual production during this period, amounting to 2000 
cars. The narrow upper lines in the January and February 
columns measure the actual production in each of those 
months. If actual production in either month had equaled 
the scheduled production the light line would extend across 
the full monthly space. When actual production in a given 
month exceeds the scheduled production a double light line 
appears. 

It should be noted that the spaces into which each 
monthly period is divided represent equal time intervals 
but varying amounts in terms of actual production. Thus 
the space representing one fifth of the January interval 
represents a production of 400 cars (the January quota 
being 2000). The space representing one fifth of the 
August interval represents 1000 cars (the August quota 
being 5000). In reading the chart in terms of absolute 
magnitudes reference must be had to the monthly quotas. 

The situation at the end of September is shown in 
Fig. 21. The arrow at the top of the diagram indicates 
the point of time actually reached. That actual production 
is behind scheduled production by one-half month is ap- 
parent from the relation between this arrow and the heavy 


(M10g Jequieydeg uo oe oy} SUIMOYS) “yweYD ssorZo1g yyUeXD :qndyng jen~y pue parnpeyos jo uosiredm0g — ‘1g ‘OL 


UolzoNpoIg 
afiqowoyny [Jompaodg 


uoHonposy 
piqowomy [empoodg 


SU ‘TI Jo SV: ZIeYO ssoiboig 


GRAPHIC PRESENTATION 51 


black line, while the light lines indicating monthly produc- 
tion show that actual output has exceeded the monthly 
quota in each of the last three months. 

The Gantt chart has a great variety of applications in 
governmental and business organizations. The economy 
of space is such that developments in a number of depart- 
ments or districts may be shown on a single chart. It 
constitutes the simplest and most effective graphic method 
known for following the progress of work under way, for 
comparing actual accomplishment with an established pro- 
gram. And in so doing, it increases by so much the effi- 
ciency of administrative control. 


STANDARD RuLES FoR GRAPHIC PRESENTATION 


Graphic methods have been widely employed in the 
physical and social sciences and in business, and the result- 
ing diversity of uses has made it difficult to secure stand- 
ardization of practice. To remedy this defect, a joint 
committee composed of representatives of the various 
groups interested in this subject prepared a report recom- 
mending the employment of certain standard methods in 
graphic work. This report, which is a model of clear and 
succinct presentation, has appeared in various publications, 
but it cannot be too widely circulated. The suggestions of 
the committee are presented below !: 


1 This report, which is the work of the Joint Committee on Standards for 
Graphic Presentation, of which Willard C. Brinton was chairman, appeared in 
the Quarterly Publications of the American Statistical Association, Vol. 14, 790- 
797, 1915. It is presented here with the permission of Mr. Brinton. Copies may 
be secured from the American Society of Mechanical Engineers, New York. 


52 GRAPHIC PRESENTATION 


Population 
100,000,000 


1. The general arrangement of 
a diagram should proceed from left 


Illustration 1 


‘Year Tons 
1900. 270,568 Smal EE 
1914. 555,031 Sa LI ad 


Illustra‘ion 2 


2. Where possible represent quantities by linear magnitudes, 
as areas or volumes are more likely to be misinterpreted. 


Sales 


3. Fora curve the vertical scale, 
whenever practicable, should be so 
selected that the zero line will 
appear on the diagram. 


0 
123456789100 12 
( Months 
Illustration 3 


GRAPHIC PRESENTATION 53 


4. If the zero line of the ver- 
tical scale will not normally 
appear on the curve diagram, the 
zero line should be shown by the 
use of a horizontal break in the 
diagram. 


0 
a (es ff = eu 
Hour 


Illustration 4 


Population 

100,000,000 

80,000,000 

5. The zero lines of the scales 60,000,000 
for a curve should be sharply 40,000,000 


distinguished from the _ other 


codrdinate lines. S eaied els Yel olde re 


Illustration 5a 


Illustration 5b Illustration 5c 


54 GRAPHIC PRESENTATION 


6. For curves having a scale rep- 
resenting percentages, it is usually 
desirable to emphasize in some dis- 
tinctive way the 100 per cent line or 
other line used as a basis of com- 
parison. 


Relative 
Cost 


Per Cent 
Utilized 


ppg SE Rd EE 
pf Sb ba ea) 


rp Nia BPS) SET ESS RS 

o ae mmr 

Kae GS: So oO 

3 BSFSB8 3 = 
Year 


Illustration 6a 


10) 
85883585835 c°ocsessesss 
2 


Year Per Cent of Income 


Illustration 6b 


7. When the scale of — Population 
a diagram refers to 
dates, and the period 
represented is not a com- 
plete unit, it is better 
not toemphasize the first 
and last ordinates, since 
such a diagram does 
not represent the begin- 5 
ning or end of time. 


Illustration 6¢ 


SBLS8Re2 
Year 
Illustration 7 


GRAPHIC PRESENTATION 55 


Population. 
100,000,000, 


8. When curves are drawn on ae 


logarithmic codrdinates, the limit- 
ing lines of the diagram should  '0:00.000 


each be at some power of ten on : 
the logarithmic scales. \ 


Illustration 8 


Population Population | 
100,000,000 100,000,000 TTT ILL ee 
80,000,000 0,000,000 FL 


BP Zale 
20,000,000 FFF 14 


O 

| i 

. Aes a 

GRSERRES @2228R32 

i ei i Sit pesca: 
Year Year 


Illustration 9a Illustration 9b 


9. It is advisable not to show any more codrdinate lines than 
necessary to guide the eye in reading the diagram. 


Population 
100,000,000 


10. The curve lines of a 
diagram should be sharply 
distinguished from the rul-  4%:000,000 


ing. 


gRSLSRE= 


Year 
Illustration 10 


56 GRAPHIC PRESENTATION 


Population 

100,000,000 

11. In curves representing a 80,000,000 

series of observations, it is ad- 

visable, whenever possible, to 
indicate clearly on the dia- 49,000,000 wy ty 

gram all the points representing — 29 000,00 b—j— 
the separate observations. 0 area een 
gRSRSSE2 
Year 


Illustration 11a 


0 
a ee Speed RPM. 
Illustration 116 Illustration 11¢ 
Population 
{00,000,000 
80,000,000 
12. The horizontal scale for 60,000,000 


curves should usually read from 


. ° 0, , 
left to right and the vertical eae 
scale from bottom to top. 20,000,000 
0 
3 23288882 
- a - 2 o 


Year 


Illustration 12 


GRAPHIC PRESENTATION 57 


13. Figures for the scales of a 
diagram should be placed at the 
left and at the bottom or along 
the respective axes. 


Illustration 13b Illustration 13c 


= \ 


Oe BAR BRE S 
@ 2a h8988 
7 of Ais SBS ss 


14. It is often desirable to 100,000,000 
include in the diagram the numer- 


J 80,000,000 
ical data or formule represented. 


Year 
Illustration 14a 


58 GRAPHIC PRESENTATION 


me NOOO 
aSsissdesRaee 


bean esac a 


0. 


oresase7890Nn eae 


onth 
Illustration 14b Illustration 1he | 


15. If numerical Population 
data are not in- 100,000,000 
cluded in the dia- — 90,000,000 
gram it is desirable 
to give the data in 
tabular form accom- —_49000,000 
panying the dia- 


,000,000 } 
ae asso atest ai 


23,191, 876 
31,443,321 
38,558,371 
50,155, 783 


Illustration 15 


Population 

100,000,000 
80,000,000 16. All lettering and 
60,000,000 all figures on a diagram 
4c. should be placed so as to 
40,000,000 be easily read from the 
20,000,000 base as the bottom, or 
: ; from the right-hand edge 
$2328 8 8 Ske of the diagram as the 

Year bettom. 


Mlustration 16 


GRAPHIC PRESENTATION 59 


a 
é 
& 

17. The title of a diagram should 
be made as clear and complete as 
possible. Subtitles or descriptions 
should be added if necessary to insure 
clearness. 


ie 23456789 0nNi2 
Month 


Aluminum Castings Output 
of Plant No. 2, by Months, 
1914. 

Output is given in short tons. 


Sales of Scrap Aluminum are not 
included. 


Illustration 17 


REFERENCES 
(1) On Mathematical Concepts 


A discussion of the elementary mathematical concepts in- 
volved in graphic presentation and the employment of loga- 
rithms will be found in any of the standard textbooks on algebra 
and coérdinate geometry. The treatment of these subjects in 
the books listed below will be found helpful to the student of 
statistics. 


Grirrin, F. L. Introduction to Mathematical Analysis. 

Karsten, Kart G. Charts and Graphs. 

Lipxa, JosernH. Graphical and Mechanical Computation. 

Mettor, J. W. Higher Mathematics for Students of Chemistry 
and Physics. 

Scuutrze, Artuur. Graphic Algebra. 

Steinmetz, C. P. Engineering Mathematics. 

Wuitenpad, A. N. An Introduction to Mathematics. 


(2) On Graphic Methods 


Brinton, W. C. Graphic Methods for Presenting Facts. 
Cuark, Wauuace. The Gantt Chari. 


60 GRAPHIC PRESENTATION 


Fievp, J. H. Some Advantages of the Logarithmic Scale in Sta- 
tistical Diagrams. Journal of Political Economy, October, 
1917. 

Fisuer, Irvine. The ‘Ratio’ Chart. Quarterly Publications of 
the American Statistical Association, June, 1917. 

Haske, A. C. Graphic Charts in Business. 

Hasxety, A. C. How to Make and Use Graphic Charts. 

Karsten, Karu. Charts and Graphs. 


(The publishers and the dates of publication of the volumes named above are 
given in the bibliography at the end of this volume.) 


CHAPTER III 


THE ORGANIZATION OF STATISTICAL DATA: 
THE FREQUENCY DISTRIBUTION 


The task of the statistician engaged in business or eco- 
nomic research includes the organization, analysis and 
interpretation of quantitative data relating to business 
affairs and to economic conditions. To these fundamental 
operations that of collecting the original data may be 
added, though more frequently data will be compiled 
directly from primary or secondary sources. 

At the outset it is necessary to distinguish between the 
problems arising in the analysis of time series and those 
involved in the organization and analysis of materials in 
connection with which the time factor does not enter. In 
studying a time series the primary object is to measure and 
analyze the chronological variations in the value of the 
variable. Thus one may study variations in sales over a 
period of years, fluctuations in the production of bituminous 
coal, or changes in the general level of prices. Quite differ- 
ent is the procedure in the study of such a problem as 
income distribution at a given time. In this case we are 
desirous of knowing how many people in the United States 
fall in each of a number of income classes. The general 
problem of organization in this latter class of cases is to 
determine how many times each value of a variable is re- 
peated and how these values are distributed. Data of this 
sort, when organized, constitute a frequency series, as 
opposed to the time or historical series. The methods 


appropriate to these two types of analysis differ funda- 
61 


62 THE ORGANIZATION OF STATISTICAL DATA 


mentally and will therefore be treated separately. In the 
present section we are concerned with the organization and 
preliminary analysis of data in connection with which the 
time element, while it may be present, does not enter as 
a factor. 


UNORGANIZED DATA 


When quantitative data of the type with which the 
statistician works are presented in a raw state they appear 
as unorganized masses of material, without form or struc- 
ture. They may have been drawn from the production or 
sales records of a business establishment, or they may | 
represent a miscellaneous collection of price quotations. If 
tke data have been gathered by other agencies they may 
already have been arranged in the form of a general table, 
but this form may be entirely unsuited to the particular 
object in the mind of the investigator. The first task of 
the statistician is the organization of the figures in such 
a form that their significance, for the purpose in hand, may 
be appreciated, that comparison with masses of similar 
data may be facilitated, and that further analysis may be 
possible. Scientific method, it has been* noted, involves 
observation, inference and verification. Data, the results of 
observation, must be put into definite form, must be given 
coherent structure, before the process of inference is 
possible. 

The following figures, representing the earnings during a 
given week of 210 individuals engaged in piece work in a 
certain manufacturing establishment, will serve as an 
example of such data in their raw state: 


THE FREQUENCY DISTRIBUTION 63 


by WEEKLY EARNINGS OF 210 EMPLOYEES 


$26.25 $28.70 $24.15 $29.75 $29.20 $30.60 ($23.40 $24.75 


Toe ARRAY 


Tf these figures are arranged in order of magnitude some- 
thing will have been done toward securing a coherent 
structure. The range covered and the general distribution 
throughout this range will then be clear, and the way will 
be prepared for further organization. When so arranged 
the following array is secured: 


64 THE ORGANIZATION OF STATISTICAL DATA. 


ARRAY: WEEKLY Earnincs oF 210 EMPLOYEES 


$22.55 $25.15 $26.15 $26.75 $27.45 $27.95 $28.60 
23 .00 25.15 26.15 26.75 QT 45 27.95 28 .65 
23 .00 25.15 26.25 26.80 27.50 28 .00 28.70 


FREQUENCY TABLES 


While this array presents the figures in a shape much 
more suitable for study than the haphazard distribution 
first shown, there is still something to be desired before the 
mind can readily grasp the full significance of the data. 
The factory manager may see that the smallest amount 
earned during the week was $22.55, that the largest amount 
earned was $32.00, and that most of the employees earned 
between $25.00 and $29.00, but this is still a vague descrip- 
tion of the data. By a process of grouping, that is, by 


THE FREQUENCY DISTRIBUTION 65 


putting into common classes all individuals whose earnings 
fall within certain limits, a simplified and more compact 
presentation of the wage distribution may be obtained. 
The following table shows the results of this grouping 
process when the range of each class (the class-interval) is 


two dollars. 
TABLE 6 


Frequency Distribution of Employees 
Classified on the Basis of Weekly Earnings (Class-interval = $2) 


Weed earns Number earning stated amount 


(frequency) 

$22.00 to $23.99 8 
24.00 to 25.93 48 
26.00 to 27.99 96 
28.00 to 29.99 47 
30.00 to 31.99 10 
32.00 to 33.99 1 
210 


This table presents a condensed summary of the original 
figures, a summary which not only gives us the approximate 
range of the earnings, but shows, also, how the earnings of 
the 210 workers are distributed throughout this range. 
There has been a considerable loss of detail, it will be noted. 
From this table we may learn that there are 48 persons who 
earned during the given week between $24.00 and $25.99, 
but we cannot learn how the earnings of the 48 individuals 
were distributed throughout this range of two dollars. All 
may have earned exactly $24.00, so far as we may know 
from the figures shown in the table. This loss of detail is 
an inevitable accompaniment of the condensation and 
simplification which the process of classification involves. 

If the size of the class-interval be decreased the loss of 
detail is less pronounced, though the increase in the number 
of classes means a more cumbersome table and one which 
presents a more complex picture to the eye. The tables 
which follow present the same data, classified with intervals 
of one dollar, fifty cents and twenty-five cents: 


66 THE ORGANIZATION OF STATISTICAL DATA 


TaBLe 7 
(Class-interval = $1) 


Weekly 
Earnings 


.00 to 
.00 to 
.00 to 
.00 to 
.00 to 
.00 to 
.00 to 
.00 to 
.00 to 
.00 to 


.00 to $22 
23. 
24. 
25. 
26. 
27. 
28. 
29. 


30 
31 
32 


.99 
99 
99 
99 
99 
99 
99 
99 
“99 
.99 
.99 


FreQuEency DISTRIBUTIONS OF EMPLOYEES 
Classified on the Basis of Weekly Earnings 


Fre- 
quency 


TABLE 8 

(Class-interval = 50 cents) 
Weekly Fre- 

Earnings quency 
$22.50 to $22.99 1 
23.00 to 23.49 4 
23.50 to 23.99 3 
24.00 to 24.49 11 
24.50 to 24.99 10 
25.00 to 25.49 12 
25.50 to 25.99 15 
26.00 to 26.49 22 
26.50 to 26.99 20 
27.00 to 27.49 24 
27.50 to 27.99 30 
28.00 to 28.49 17 
28.50 to 28.99 ale; 
29.00 to 29.49 7 
29.50 to 29.99 6 
30.00 to 30.49 5 
30.50 to 30.99 4 
31.00 to 31.49 1 
31.50 to 31.99 0 
32.00 to 32.49 1 
210 


TABLE 9 
(Class-interval = 25 cents) 


Weekly 
Earnings 


Fre- 
quency 


B22 .75 to $22. 
00)tor 23: 
+25 to 23. 
50 to 23. 
75 to 23. 
00 to 24. 
.25 to 24. 
.50 to 24. 
-75 to QA. 
.00 to 25. 
.25 to 25. 
.50 to 25. 
LO ptOm 25. 
.00 to 26. 
.25 to 26. 
.50 to 26. 
.75 to 26. 
00 to 27. 
25 to 27. 
560 to 97. 
TOON eas 
.00 to 28. 
.25 to 28. 
.50 to 28. 
-75 to 28. 
00 to 29. 
.25 to 29. 
.50 to 29. 
.75 to 29. 
.00 to 30. 
.25 to 30. 
.50 to 30. 
16 to) 30) 
LO0tom SI 
.25 to 31. 
.50 to 31. 
oto Si. 
.00 to 32. 


ADU PrP WOK BO OOK OOH 


Re OO OS = eH Oo me & 09 09 09 POF PO 


ESS) 
= 
oO 


THE FREQUENCY DISTRIBUTION 67 


The four tables above (including Table 6) represent 
four different degrees of condensation of the same data. 
Tables 6, 7 and 8 present the same general characteristics: 
a small number of cases in the extreme classes and a more 
or less regular increase in the frequencies as the center of 
each of the distributions is approached. The departure 
from regularity becomes greater the greater the number of 
classes. Table 9, in which the class-interval is 25 cents, 
has 38 classes. In this table the distribution of cases 
throughout the range is highly irregular, with pronounced 
departures from symmetry. The structure of each of the 
other tables is orderly and approaches more closely a 
condition of symmetry. Each presents the wage data in 
condensed and compact form, so that one consulting the 
tables may learn of the size and distribution of weekly 
earnings in the factory in question much more readily 
than by reference to the chaotic collection of figures first 
shown. Such organized collections of data are termed 
frequency distributions, and their purpose, as the term 
implies, is to show in a condensed form the nature of the 
distribution of a variable quantity throughout the range 
covered by the values of the variable. The construction 
of such a table is the first step to be taken in the organization 
and analysis of quantitative data of the type represented 
above. 


STEPS IN THE CONSTRUCTION OF A FREQUENCY TABLE 


This general introduction to the subject of frequency 
tables has left untouched many important matters in con- 
nection with their construction. It remains to present a 
summary statement of these details. It will be clear that 
the first step here taken, the arrangement of the items in 
_ order of magnitude, is unnecessary in the actual construc- 
tion of such a table. Having determined the upper and 
lower limits through an inspection of the data, one has 
but to decide on the number of classes desired, write the 


68 THE ORGANIZATION OF STATISTICAL DATA 


class-intervals on an appropriate blank sheet and proceed 
to tally the cases falling in each of the classes thus set off. 
When this process is completed the frequencies are com- 
puted and the totals arranged in tabular form of the type 
illustrated above. These simple operations involve deci- 
sions on a number of points, however. 


SizzE oF Cuass-INTERVAL 


In deciding upon the size of the class-interval (which is 
equivalent to deciding upon the number of classes) one 
fundamental consideration should be borne in mind, 
namely, that classes should be so arranged that there will 
be no material departure from an even distribution of cases 
within each class. This arrangement is necessary because 
of the fact that, in interpreting the frequency table and in 
subsequent calculations based upon it, the mid-value of 
each class is taken to represent the values of all cases falling 
in that class. Thus, in basing calculations upon Table 8, 
it is assumed that the 22 cases falling between $26.00 and 
$26.50 may all be represented by the mid-value of that 
class, $26.25. This assumption will seldom be strictly valid. 
In the case just cited reference to the original figures will 
show that it is not a correct assumption. Absolute accu- 
racy would only be obtained by having a class for every 
value represented in the original figures. Since condensa- 
tion is necessary an arrangement of classes should be 
secured which will minimize the error involved, without - 
transgressing other requirements. Table 6 furnishes an 
example of class-intervals too wide for the material. 

The requirement which has just been described clearly 
calls for a large number of classes. A second requirement, 
which ordinarily conflicts with this, is that the number of 
classes should be so determined that an orderly and regular 
sequence of frequencies is secured. If the classification is 
too narrow for the data regularity will not be attained in 


THE FREQUENCY DISTRIBUTION 69 


this respect, and a table without structure or order will be 
secured. Table 9 fails to meet this requirement, as has 
been pointed out. It is desirable, also, that the number of 
classes be limited in order that the data may be easily 
manipulated and their significance readily grasped. As a 
general rule the class-intervals should be so adjusted that 
the number of classes is not less than 10 nor more than 25. 
The exact number which should be established in a given 
case will depend upon the nature of the data. In the 
above example Table 8, in which the class-interval is fifty 
cents, seems to conform most thoroughly to all these 
requirements. 
Location oF Crass Limits 


The location of class limits is a matter of considerable 
importance, for attention to this matter will simplify 
tabulation and facilitate later calculation. Tabulation of 
data is easiest when class limits are integers and the class- 
interval itself is a whole number. Calculation of averages 
and other statistical measures is facilitated when the mid- 
values of classes are integers. Suitable class limits and 
mid-points are usually secured when the data permit class- 
intervals of 5 or multiples of 5 to be employed, though 
such an arrangement is by no means essential. 

Some types of data show a tendency to cluster or con- 
centrate about certain values on the scale along which they 
are distributed. This is illustrated by the following figures 
which form part of a table showing the number of pieces 
of commercial paper discounted by the Federal Reserve 
Banks in 1921, distributed according to rates of discount 
or interest charged by member banks: 


Rate (per cent) Number of pieces 
6 18,970 
61 697 
63 4,616 
63 135 
if 17,362 
qh 10 


70 THE ORGANIZATION OF STATISTICAL DATA 


Here is a quite obvious bunching about the integers, with 
a secondary concentration at each half of one per cent. 
No cases at all fall between the quarter values here shown. 
It is clear that in classifying such data the mid-points of 
the various classes should fall at those values about which 
the cases are concentrated, and class limits must be located 
with this end in view. For, as noted above, calculations 
based upon the frequency table are performed upon the 
assumption that all the items in each class are concentrated 
at the mid-point of that class. Thus, if a class interval of 
one half of one per cent were selected in the above example, 
the classes should extend from 52 to (but not including) 
61, 64 to 63, etc., rather than from 6 to 64, 64 to 7, etc. 


ACCURACY OF OBSERVATIONS AND THE DEFINITION 
OF CLASSES 


In the construction of frequency tables it is essential 
that there be a clear definition of classes, so that there may 
be no uncertainty as to their range and no question as to 
the precise class in which a given case falls. A table with 
an arrangement similar to the following is sometimes en- 
countered: 


Class-interval Frequency 
0 to 10 
10 to 20 8 
20 to 30 15 
30 to 40 6 
40 to 50 Q 


In the absence of explanation, a question arises at once as- 
to whether a case with a value of 10 would fall in the first 
or in the second class. It is highly desirable that the range 
of each class be indicated in some such way as the follow- 
ing, in order that this ambiguity may not arise: 


Class-interval Frequency 
Oto 9.9 
10 to 19.9 8 
20 to 29.9 15 
30 to 39.9 6 
40 to 49.9 g 


Bibliothéque, 
Université du Québec 
imouski 


THE FREQUENCY DISTRIBUTION 71 


This procedure solves the difficulty, however, only in case 
the observations are accurate to the nearest tenth. If the 
observations are accurate only to the nearest unit (that is, 
if the cases given a value of .10 actually lie between 9.5 
and 10.5) a mere change in the description of the class- 
range does not solve the problem of allocating a case at 
the class limit. In such a case an observation falling at a 
class-boundary may be cut in two, one half being allocated 
to each of the adjacent classes. 

Yule! lays down the useful principle that in fixing a 
class boundary the limit should be carried to a farther 
place in decimals, or a smaller fraction, than the values of 
the individual cases as originally recorded. Thus, in the 
preceding example, if observations were correct to the 
nearest tenth, it would mean that a value recorded as 9.9 
actually lay between 9.85 and 9.95. In accurately describ- 
ing the classes, therefore, the intervals should be given as 
0 to 9.95, 9.95 to 19.95, ete. It should be noted that the 
values of the mid-points, with these class limits, would 
be 4.95, 14.95 etc. In presenting and using the table 
as given above the real meaning of the class limits 
should be borne in mind. In all cases class boundaries 
must be fixed with reference to the accuracy of the 
observations. 

The work of tabulation is simplified if, in designating a 
class, both limits are stated, as above. Errors are likely if 
only the lower limit of each class is given, or if the mid- 
point alone is designated. It. is desirable, however, par- 
ticularly if calculations are to be based upon the table, to 
include a separate column showing the values of the mid- 
points of the various classes. 


OTHER REQUIREMENTS 


Class-intervals should be uniform throughout the table 
in order that all classes may be comparable. Occasionally 
1 An Introduction to the Theory of Statistics, 81. 


72 THE ORGANIZATION OF STATISTICAL DATA 


tables are published with varying class-intervals, so that 
on one section of the scale the number of items falling 
within a class having an interval of 5 is given, and on 
another section of the scale the number of items falling 
within a class having a range of 10 is given. Obviously, 
comparison of classes is impossible. It may be desirable 
to show in more detail the cases falling within certain 
ranges on the scale, but this end is best achieved by the 
construction of a supplementary table relating only to the 
cases falling within this restricted section. The utility of 
the main table is not lessened thereby. 

Similar in nature is the requirement that there should be 
no indeterminate classes, that is, classes the ranges of which 
are not defined. Had all the individuals making $30.00 and 
over in the illustration of piece-work earnings been entered 
in a class with the designation ‘$30.00 and over,” the 
upper limit of this class would have been quite uncertain. 
This fault in a table is a vital one when it is desired to base 
calculations upon the data contained in the table. When 
there are several extreme cases the inclusion of such classes 
is sometimes unavoidable, but when this is done the actual 
values of the cases included in such “‘open end”’ classes 
should be given in a footnote to the table. 

The errors described in the two preceding paragraphs 
are exemplified in the table below. 


TABLE 10 


Frequency Distribution of Employees, ABC Company 
Classified on the Basis of Daily Wages 


Number of workers in each class 


Wage group (frequency) 
Less than $3.00 15 
$3.00 to 3.99 30 

4.00 to 4.99 40 
5.00 to 6.49 30 
6.50to 8.99 10 
9.00 and over 5 


THE FREQUENCY DISTRIBUTION 73 


In this case the ranges of the two “‘open-end”’ classes are 
not known. The ranges of the intermediate classes vary, 
being $1.00 for two elasses, $1.50 for one class, and $2.50 
for one class. Such a table is of little value. 


Tue STRUCTURE OF STATISTICAL TABLES 


The preceding discussion has been confined to certain 
more or less technical problems which arise in the con- 
struction of a frequency table. Nothing has been said di- 
rectly as to the form of the completed table, the arrangement 
of columns and rows, the title, the notation. No general 
principles of tabular arrangement have been laid down. 
While no detailed treatment of these principles is possible 
within the scope of the present discussion, certain general 
considerations relating to the structure of statistical tables 
may be suggested. 

The statistical table is merely a device for presenting in 
summary fashion a mass of quantitative data. Unless the 
summary be clear, significant, concise and readily inter- 
preted nothing has been gained by the process of tabulation 
and classification. A sprawling, formless table is like a 
rambling, unintelligible discourse. There must be a purpose 
in back of each table, and this purpose should be clearly 
brought out in its arrangement. The means by which this 
purpose may be attained in a given case must be deter-: 
mined with reference to the specific conditions affecting that 
case, but standard practices should be followed, in so far 
as possible. The following general principles will be found 
helpful in deciding upon the form and arrangement of 
statistical tables: 


1. The title should constitute a clear, concise and complete 
description of the material assembled in the table. 
2. Headings of columns and rows should be concise and un- 


ambiguous. 


v4 THE ORGANIZATION OF STATISTICAL DATA 


3. Variable quantities should increase from left to right and 
from top to bottom, when such arrangement is feasible. 

4. Columns and rows may be numbered to facilitate reference 
to the table. 

5. The units of measurement employed should be clearly 

indicated. 

. Sources should be given in all cases. 

7%. The table should constitute a unit, self-sufficient and _ self- 
explanatory. All explanations necessary for its interpre- 
tation should be included as integral parts of the table, or 
in the form of footnotes. 


or) 


GRAPHIC REPRESENTATION OF FREQUENCY 
DISTRIBUTIONS 


Frequency distributions of the type illustrated above 
serve a very important statistical function in presenting a 


100; 
Class Interval = 2.00 
hrs re kee 


Fic. 22. — Column Diagram: Distribution of 210 Employees Classified on the 
Basis of Weekly Earnings. (Class-interval = $2.00) 


compact summary of data, and in preparing these data for 
further manipulation. Such distributions may be pre- 
sented not only in tabular form, but graphically, utilizing 


THE FREQUENCY DISTRIBUTION 75 


the general principles of the codrdinate system which were 
explained above. Many of the characteristic features of a 
frequency distribution are most clearly revealed when the 
graphic method is adopted. 

Table 6, presenting the weekly earnings of 210 em- 
ployees, with a class-interval of two dollars, is depicted 


Class Interval =*1.00 


Dollars 


Fic. 23. — Column Diagram: Distribution of 210 Employees Classified on the 
Basis of Weekly Earnings. (Class-interval = $1.00) 


graphically in Fig. 22. In this figure class-intervals are 
plotted along the z-axis and the corresponding class-fre- 
quencies along the y-axis, appropriate scales being selected. 
The fact should be noted that the scale of abscissas starts not 
with zero, but with $20. For convenience in presentation 
that part of the scale extending from 0 to $20 is omitted. 


76 THE ORGANIZATION OF STATISTICAL DATA 


The student should bear this in mind in seeking to secure 
a correct impression of the relations between the two 
variables plotted. In constructing such a figure, which is 
termed a column diagram or histogram, short horizontal 
lines are drawn connecting the points plotted to represent 
the upper and lower limits of each class-interval. In inter- 


Class Interval=*50 


20 


Frequency 
a 


to) 


OF =a a 25266 a 28 29° =30 93) “S235 
Dollars 


Fie. 24. — Column Diagram: Distribution of 210 Employees Classified on the 
Basis of Weekly Earnings. (Class-interval = $.50) 


preting this diagram it should be noted that the areas of. 
the different rectangles are proportional to the number of 
cases represented, the total area representing the entire 210 
cases. ‘This device thus presents to the eye a very clear 
picture of the distribution, showing quite unmistakably 
the relative number of workers falling in each of the wage 
classes. 

The classes in this case are so large, however, that some 
violence is done to the facts. So many details are lost 


THE FREQUENCY DISTRIBUTION eel 


that a true conception of the disposition of the items is not 
given. Fig. 23 is a histogram depicting the distribution of 
cases when a class-interval of one dollar is used. In this 
case, with smaller steps, we approach more closely an orderly 
and symmetrical distribution. The same is true of Fig. 24 
which shows the distribution when the class-interval is 


16 


Frequency 
(ee) 


as 


Class Interval=*25 


) all il 


jay oy ESATO LE 28-29" oUt ol Pole eos 
Dollars _ 


Fic. 25. — Column Diagram: Distribution of 210 Employees Classified on the 
Basis of Weekly Earnings. (Class-interval = $.25) 


fifty cents. The distribution represented in Fig. 25 has a 
class-interval of twenty-five cents which, as has been 
pointed out, is too narrow for the data, with the result 
that a quite irregular structure is secured. (It should be 
noted that the vertical scale is not the same in these four 
figures, so that comparison with respect to class-frequencies 
is only possible by reference to the scale figures.) 
Frequency polygons corresponding to the histograms of 
Figs. 22, 23, and 24 are shown in Figs. 26, 27, and 28, 


Class Interval =*2.00 


uenc 
hear 
fo) 


Fre 


ow 
2he) 


28 
Dollars 
Fic. 26. — Frequency Polygon: Distribution of 210 Employees Classified on the 
Basis of Weekly Earnings. (Class-interval = $2.00) 


6) - a 
20 21 22 23 24 25 26 27 28 29 3O 31 32 33 
Dollars 


Fic. 27. — Frequency Polygon: Distribution of 210 Employees Classified on the 
Basis of Weekly Earnings. (Class-interval = $1.00) 


THE FREQUENCY DISTRIBUTION 79 


Each of these polygons has been constructed by plotting as 
abscissas the mid-points of the class-intervals, and as ordi- 
nates the class-frequencies, the points thus secured being 
connected by a broken line. In completing such a figure 


Class Interval = #50 


fe) 
eae. D582 06 Jim 28: 2: OO. Ol @ oz ri se 
Dollars 


Fig. 28. — Frequency Polygon: Distribution of 210 Employees Classified on the 
Basis of Weekly Earnings. (Class-interval = $.50) 


the class next below the lowest one on the scale and the 
class next above the highest one on the scale are included, 
the class-frequency being zero in each case. The ends of 
the polygon thus connect with the base line at the mid- 


80 THE ORGANIZATION OF STATISTICAL DATA 


points of these two extra classes. In the case of the fre- 
quency polygon the entire area under the curve represents 
the entire number of cases, but the area of a given interval 
cannot be taken to be proportional to the number of cases 
in that interval, because of irregularities in the distribution 
on either side of the given class. The heights of the ordi- 
nates at the mid-points of the various classes are, of course, 
scaled to represent the class-frequencies. 


THE SMOOTHING OF CURVES 


Attention is again called to the results secured with 
varying class-intervals. As the class-interval is decreased, 
up to a certain point, the histograms and polygons become 
smoother and more regular. Beyond that point breaks 
begin to appear in the data; the regular change in class- 
frequencies which was found when the classes were larger 
is broken by the appearance of irregular classes which 
seem to depart from the general rule. In Fig. 25 these 
have become quite pronounced. Such irregularities, it is 
obvious, are exceptions to a general rule which seems to 
prevail, the general rule that the numbers of workers falling 
within the different wage classes increase from the lower 
limit of earnings up to a maximum in the neighborhood of 
$27.50, and then decrease till, at the upper limit of $32, 
but one worker is found. Since all the 210 individuals are 
engaged in the same work, and since their earnings depend 
only upon their rapidity and skill, one would expect a quite 
regular increase and decrease. If we had figures not for 
one week only, but for 52 weeks, and took the average 
weekly earnings of each of the 210 workers for the year, 
we should expect greater regularity with the smaller class- 
intervals than is actually found, since the accidental fluc- 
tuations peculiar to one week alone would thus be elimi- 
nated. Or, if we had earnings during one week for 10,920 
workers (52 times 210), the same result would be secured. 


THE FREQUENCY DISTRIBUTION 81 


Thus, if regularity and smoothness are to be secured, it is 
essential not only to decrease the size of the classes but 
also to increase the number of cases, in order that the 
accidental irregularities which affect a small number of 
observations may be eliminated. A refined classification 
with a small number of cases leads to the condition exempli- 
fied in Fig. 25. But such an increase in the number of 
cases is, in general, a practical impossibility. We wish, if 
possible, to develop a feasible method of approximating 
the distribution which would be secured with very small 
class-intervals and a very large number of cases. Such an 
approximation is possible through the device of curve- 
smoothing. By this method we may secure a smooth 
frequency curve which lacks the irregularities occasioned by 
minor fluctuations. 

Such a smooth frequency curve serves to represent the 
true underlying distribution of the data. It was pointed 
‘out that areas in the frequency polygon are not propor- 
tional to the number of cases included, the cause lying in 
the irregularities of the data. In a smoothed frequency 
curve these irregularities have been eliminated, and the 
area between ordinates erected at given points on the scale 
of abscissas is assumed to be proportional to the theoretical 
frequency of cases between the given values. Moreover, a 
smooth trend having been established, frequencies for in- 
termediate values not shown in the original table may be 
determined by interpolation.! 

The following data,? representing the distribution in 1918 
of personal incomes below $4000, will serve to exemplify 
the smoothing process. 

1 The limitations of practical statistical work are such that there must of 
necessity be many gaps in the data. The given values of the variables are not 
continuous. Interpolation is the process of estimating values of a variable quantity 
between given values, or of locating a point on a curve between given points. 


That interpolation is most accurate which leads to estimated values having the 


highest degree of consistency with the given values. 
2 From Vol. I, Income in the United States, National Bureau of Economic 


Research. New York, Harcourt, Brace & Co., 1921, 132-33. 


82 THE ORGANIZATION OF STATISTICAL DATA 


TABLE 11 


Distribution of Income Among Personal Income Recipients in 1918 
(Including all personal incomes below $4000) 


Income Class } Number of Persons ? 
$ Oto $100 62,809 
100 to 200 103,704 
200 to 300 209,087 
300 to 400 489,963 
400 to 500 961,991 
500 to 600 1,549,974 
600 to 700 2,154,474 
700 to 800 2,668,466 
800 to 900 3,013,034 
900 to 1,000 3,144,722 
1,000 to 1,100 3,074,351 
1,100 to 1,200 2,850,526 
1,200 to 1,300 2,535,285 
1,300 to 1,400 2,205,728 
1,400 to 1,500 1,832,230 
1,500 to 1,600 1,512,649 
1,600 to 1,700 1,234,397 
1,700 to 1,800 999,996 
1,800 to 1,900 811,236 
1,900 to 2,000 663,789 
2,000 to 2,100 549,787 
2,100 to 2,200 463,222 
2,200 to 2,300 395,115 
2,300 to 2,400 340,141 
2,400 to 2,500 * 295,490 
2,500 to 2,600 258,650 
2,600 to 2,700 227,731 
2,700 to 2,800 201,488 
2,800 to 2,900 178,901 
2,900 to 3,000 154,499 
3,000 to 3,100 142,802 
3,100 to 3,200 128,217 
3,200 to 3,300 115,583 
3,300 to 3,400 104,504 
3,400 to 3,500 94,803 
3,500 to 3,600 86,405 
3,600 to 3,700 79,023 
3,700 to 3,800 72,562 
3,800 to 3,900 66,900 
3,900 to 4,000 61,894 


1 The definition of classes used is equivalent to “$0 to and not including $100,” 
etc. Thus an individual with an income of $100 would fall in the second class. 

* The Bureau’s report states “The numbers below are given to the nearest 
unit. It is not pretended that such arithmetic accuracy is anything more than 
technical.” 


THE FREQUENCY DISTRIBUTION 83 


Figures 29, 30, and 31 present column diagrams of these 
income data, grouped with class intervals of $500, $200 and 
$100. As the class-interval is decreased the histograms be- 
come more regular and uniform, but our original data 
permit us to carry this process only to the point where the 
class-interval is $100. Our problem is to determine the 
underlying distribution which the data approximate more 
and more closely as the class-interval is lessened. If 


Class Interval =#500. 


tas Pe Ei oe ae 

7600000 

. iene are 

- Hike oe 

a cme 
500 1000 1500 2000 2500 3000 3500 4000 


Dollars 


Fic. 29. — Column Diagram: Distribution of Personal Income Recipients in 
the United States, 1918. Including all Recipients of Incomes below $4,000 
(Class-interval = $500) 


we replace the broken line of the histogram by a smooth 
curve enclosing the same total area as the histogram and so 
drawn through the points of the histogram that the area 
cut from each rectangle 1s approximately equal to the area 
added to the same rectangle by the curve, we will have a 
frequency curve representing the desired distribution. ‘The 
requirement that the same total area be enclosed is funda- 
mental. Exceptions to the rule concerning the area of 
individual rectangles will frequently occur because of the 


84 THE ORGANIZATION OF STATISTICAL DATA 


existence of quite irregular classes, but as a general work- 
ing principle it is helpful. (More refined methods of fitting 
a smooth curve to data will be discussed at a later point, 
but a process of smoothing by inspection such as that 
described above gives a fairly close approximation to the 
required curve.) 

Figure 32 illustrates the result of smoothing the histo- 
gram of income distribution shown in Fig. 31. Here the 


Ollars 


Fic. 30.— Column Diagram: Distribution of Personal Income Recipients in 
the United States, 1918. Including all Recipients of Incomes below $4,000 
(Class-interval = $200) 


quite artificial jumps between income classes are smoothed , 
out, and we secure the graduation by infinitesimal incre- 
ments which we should expect to find when the incomes of 
so many millions of persons are included. Here we have 
that which we desired — an approximation to the true 
underlying distribution, with the sharp breaks resulting 
from the method of classification eliminated. 


THE FREQUENCY DISTRIBUTION 85 


ContTINUOUS AND DiscRETE SERIES 


The logical validity of the smoothing process is de- 
pendent upon the nature of the data being manipulated. 
From this point of view frequency series of the type dis- 
cussed above may be divided into two classes, continuous 
serves and non-continuous or discrete series. A continuous 
series is one in which the values of the independent variable 


Fre 
— 
(em' 
Go 
.@) 
BS 
S> 


wo 
ml 


0 500 1000 1500 cathe 2500 3000 3500 4000 
Le) z 


llars 


Fie. 31.— Column Diagram: Distribution of Personal Income Recipients in 
the United States, 1918. Including all Recipients of Incomes below $4,000 
(Class-interval = $100) 


increase or decrease by increments which are infinitely 
small. A discrete series is one in which the phenomena 
represented by the independent variable always change in 
value by definite amounts. The curve of underlying values 
rises not smoothly, as for the continuous series, but by 
jumps. 

The fact should be emphasized that in making this 
distinction we are speaking of the values as they would be 
found in the underlying universe of phenomena from which 


presh 


86 THE ORGANIZATION OF STATISTICAL DATA 


the actual bodies of material we study are drawn. Any 
given sample, whether representing continuous or discrete 
series, will be marked by breaks in the values of the inde- 
pendent variable. This will be true, in the case of a con- 
tinuous series, because of the limitations upon the instru- 
ments and senses we use in measuring. Thus if the heights 
of 100 men be measured, the independent variable of the 
frequency series (height) will increase by finite amounts. 


Dollars 


Fic. 32. — Frequency Curve: Distribution of Personal Income Recipients in 
the United States, 1918. Including all Recipients of Incomes below $4,000 
(Derived from the column diagram with class-interval of $100) 


[We may measure to the nearest inch, or perhaps to the - 
nearest eighth of an inch. Yet if ten thousand or ten 
million men were arranged in order of height the differences 
between successive individuals would be infinitely small. 
Height is a continuous variable, even though the values 
found in a given sample are marked by discontinuity. 
Quite different is the distribution of such a variable as 
interest or discount rates. If.one were to secure 100 such 
quotations and rank them in the order ef size the varia- 


THE FREQUENCY DISTRIBUTION 87 


tions would be discontinuous, as in the sample of men 
whose heights were measured. But in the case of heights 
the underlying values, if they could be determined for a 
large population, would be marked by continuous varia- 
tion, whereas, were an infinite number of discount rate 
quotations secured, there would still be breaks in the se- 
quence. Discount rates increase or decrease by one quarter 
or one half of one per cent, not by infinitesimal amounts. 
Such a series is termed discrete, or non-continuous. 

The smoothing process provides a means of securing an 
approximation to the distribution of values as they would 
be found if a sample could be increased indefinitely in size. 
It is based upon the assumption that the irregularities 
found in the sample actually studied are accidental, and 
that the underlying values would show continuous and un- 
broken variation. Obviously, therefore, it is only fully 
justified when applied to a continuous series. A histogram 
of human heights may be smoothed in order to secure a 
representation of the true underlying distribution in the 
population at large, and interpolation based upon this 
smoothing process is valid. But smoothing is quite illogical 
for a markedly discontinuous series. It would be meaning- 
less to construct a smooth curve showing the distribution 
of discount rates for the purpose of securing the theoretical 
frequency of a rate of 4.3675 per cent. In practical sta- 
tistical work, however, it is frequently helpful to handle 
discrete series as though they were continuous, and in these 
cases the smoothing device may be employed. But in the 
interpretation and use of the smoothed curve the important 
logical distinction between continuous and discontinuous 
variation should be kept clearly in mind.1 


1 For a fuller discussion of the points discussed in this section see G. H. Knibbs, 
“The Theory and Justification of Curve Smoothing,” in H. Secrist, Readings and 
Problems in Statistical Methods. N. Y., Macmillan, 1920, 278-82. 


88 THE ORGANIZATION OF STATISTICAL DATA 


CUMULATIVE ARRANGEMENT OF STATISTICAL DATA 


For certain purposes it is desirable to arrange data 
cumulatively, rather than in separate and exclusive classes 
of the type illustrated in the frequency tables presented 
above. The following material will illustrate some of the 
advantages of this arrangement. 

In a study of the durability of telephone poles! these 
results were secured: 


TABLE 12 


Frequency Distribution of 248,707 Telephone Poles, Classified According 
to Length of Lnfe 


Length of Life Number of Poles 
(years) (frequency) 
0- 0.9 1,150 
1- 1.9 4,221 
2- 2.9 10,692 
3- 3.9 13,966 
4— 4.9 16,633 
5- 5.9 18,211 
6— 6.9 19,011 
W 7.9 19,260 
8- 8.9 20,909 
9- 9.9 19,879 

10-10.9 20,764 
11-11.9 15,454 
12-12.9 14,237 
13-13 .9 13,779 
14-14.9 9,764 
15-15 .9 8,534 
16-16.9 7,659 
17-17.9 6,918 
18-18 .9 4,591 
19-19.9 1,798 
20-20 .9 815 
21-21.9 313 
22-22 9 102 
23-23 .9 47 


The table shows that 1,150 poles were scrapped during 
the first year of use, that 4,221 were scrapped after reaching 
1 “Replacement Insurance,’ Edwin Kurtz. Administration, July, 1921, 41-69. 


id 


THE FREQUENCY DISTRIBUTION 89 


the age of one year and before reaching the age of two 
years, and so on. This is simply a frequency table of the 
ordinary type. A much more significant arrangement for 
many purposes is secured when the figures are assembled 
cumulatively, as in the following table. 


TABLE 13 


Cumulative Distribution of 248,707 Telephone Poles, Classified 
According to Length of Life 


(Cumulated upward) 
Number of Poles Surviving 


Length of Life (frequency) 
Less than 1 year 1,150 
ee “2 years 5,371 
ae $ Bong 16,063 
ss oe Ae 30,029 
oe fg Ee Ne 46,662 
sh Ms Cmts 64,873 
SS ihe any 83,884 
ss rs Sry 103,144 
id SY Oi ae 124,053 
os coe | On mes 143,932 
i et 0 a 164,696 
We ee ae Ps 180,150 
a oe Sos 194,387 
cS ec aa ae 208,166 
= SL pee 217,930 
<a Gee 226,464 
= Ce at Wy eis 234,123 
ss £88 1S ee* 241,041 
< kip 245 632 
< a yyy 0 24'7 ,430 
‘ GOs i (Re 248,245 
“<c “ 99 “ 248,558 
$s Mile LI Ane Mee 248,660 
se Ora Alas 248,707 


_It is important to note that it is possible to cumulate a 
frequency series in two different ways. From the above 
table we may determine readily the number failing to 
attain any given age. It is often more convenient to re- 
verse the process, so that the table will enable the total 


90 THE ORGANIZATION OF STATISTICAL DATA 


number above any given value to be immediately deter- 
mined. When the telephone pole figures are thus cumulated 
downward the following table is secured. 


TABLE 14 


Cumulative Distribution of 248,707 Telephone Poles, Classified 


According to Length of Life 
(Cumulated downward) 


(1) Number of Poles Sree (3) 
Length of Life TRESS OO ISIE Per cent 
(frequency) 
0 and more 248,707 100.0 
1 years) “is os Q4'7 557 99.5 
Wyeast". per 243,336 97.8 
hy ae * < 232,644 93.6 
Ae os i 218,678 88.0 
5 - a 202,045 81.2 
@ a s 183,834 73.8 
a oe % : 164,823 66.3 
Sa * Ss 145,563 58.5 
9 * es ey 124,654 50.1 
OM fe es 104,775 42.1 
aI Lets ae oe “ 84,011 Sys) te) 
ig i Zz 68,557 27.6 
lige i be 54,320 CAL ts} 
1a, + oe 40,541 16.3 
Ls * 2 30,777 12.4 
16 6c “c “ec 22,243 8.9 
17 “ “ec “c 14,584 9 
18 “ “c “6 7,666 31 
19s Sy cutie mee 3,075 1.2 
20 “cc “ce “<c 1,277 0.5 
21 “ce “6 “ec 462 0 -) 
22 “c ““ “cc 149 0 06 
23 “<é “cc “cc AT 0 02 
Q4 (77 “ce 13 0 0 00 
Cumulative tables such as those given above have 
distinct advantages in handling many types of data. Life 


tables are generally presented in this form. The scientific 
study of depreciation will lead to the construction of 


66 


elaborate 


mortality tables” for various types of equip- 


ment, and these will be most useful in the cumulative 


THE FREQUENCY DISTRIBUTION 91 


form. It is frequently desirable to reduce the frequencies 
to percentages, as in column (3) of Table 14, though it 
should not be forgotten that the significance of the per- 
centages depends upon the absolute numbers upon which 
they are based. 


THE OaIvE, oR CUMULATIVE FREQUENCY CURVE 


The general utility of such cumulated data is limited by 
the classification system necessarily adopted in condensing 


O02 4 6 8 10 12 4 16 18 20 22 24 
Length of Life in Years 


Fic. 33. — Cumulative Frequency Curve: Distribution of Telephone Poles 
Classified according to Length of Life. (Cumulated upward) 


the material. Unless we interpolate mathematically we are 
limited to the points on the scale actually noted in the two 
tables. For this reason, a generalized cumulative curve 
similar to the smoothed frequency curve described in the 
preceding section is desirable. If the values given in 
Table 13 be plotted on codrdinate paper (the length of life 


92 THE ORGANIZATION OF STATISTICAL DATA 


in each case as abscissa, and the corresponding number of 
poles as ordinate) and a smooth curve drawn through the 
points thus plotted, the cumulative frequency curve shown 
in Fig. 33 is secured. In Fig. 34 the data of Table 14 are 
plotted. 

Such a curve constitutes one of the most effective and 
useful representations of a frequency series. It is obvious 


Number 
of Poles 


mn NURERORR EE” 
aN Te 


BEHES 3 - - 
Range: 
ERRERASSS) 


0-224 6° 18 (10 12 ie Vie VIS S20 ae 
Length of Life in Years 


Fic. 34. Cumulative Frequency Curve: Distribution of Telephone Poles 
Classified according to Length of Life. (Cumulated downward) 


100,000 


50,000 


that the limitations of the particular class-interval adopted 
are in large part removed; the shape of the curve will be 
fundamentally the same, though the class-interval and 
number of classes may vary. Frequency curves of the 
usual type may not be compared unless the groupings are 
the same, but cumulative frequency curves are subject to 
no such restriction. Moreover, uneven class-intervals do not 
distort the ogive, or cumulative curve, as they do the 
ordinary frequency curve. 


THE FREQUENCY DISTRIBUTION 93 


The cumulative curve is particularly well adapted to 
interpolation. Thus if it is desired to know the number of 
poles surviving less than 15} years, the value of the ordi- 
nate of the curve having 15} as abscissa may be approxi- 
mated from Fig. 33. A value of 222,000 is secured. If the 
number surviving 83 years or more is desired, a similar 
estimate may be made from Fig. 34. The interpolated 
figure in this case is 135,000. 

Another type of interpolation possible with such a curve 
is the determination of the number of cases falling within 
any given interval. One is not limited to the class-intervals 
marked out in the original tables. For instance, it may be 
desirable to know the number of poles surviving more than 
103 but less than 15 years. Reading from the table or 
from the chart we find that 217,930 poles survived less 
than 15 years. Interpolating on the chart in the manner 
described above a figure of 154,000 is secured for the number 
surviving less than 103 years. Subtracting the latter figure 
from the former we have 63,930 as the number of poles 
falling within the 105 to 15 years interval. The figure is, 
of course, an approximation to the true value, as are all 
values secured through such smoothing and interpolation. 

It should be noted that the ogive may be derived directly 
from the array, without the formation of a frequency table 
as an intermediate step. This curve, in fact, may be looked 
upon as merely a graphic representation of the array. It 
represents one of the simplest forms of statistical organi- 
zation, as well as one of the most effective methods of 
manipulating quantitative data. 


RELATION BETWEEN THE OGIVE AND THE FREQUENCY 
CURVE 


. The ogive and the frequency curve are merely two 
different arrangements of precisely the same material, each 
arrangement having certain distinctive advantages. The 


94 THE ORGANIZATION OF STATISTICAL DATA 


ss Than 
rs} 
io) 


ney ~Le 


equene 
to) 


Cumulative Fr 


(60 32.00 


Frequency 
or 


& 
EO 
(o) 
S 
e 
~ 
S 
S, 


53.00 %400 *5.00 %6.00 *700 48,00 #900 
Labor Cost per 1000 feet 


Fia. 35. — Distribution of Sawmills in the United States Classified according to 


Labor Cost in 1920. Illustrating the Structural Relation between the Ogive 
and the Frequency Curve. 


characteristics of each may be more clearly apparent if the 
structural relationship between these two curves is un- 


derstood. This relationship is graphically portrayed in 
Fig. 36. 


THE FREQUENCY DISTRIBUTION 95 


This figure is based upon the following frequency table, 
showing the distribution of sawmills in the United States, 
classified on the basis of labor cost per 1,000 feet of lumber 
produced.! 


TABLE 15 


Frequency Distribution of 269 Sawmills in the United States 
Classified According to Labor Cost in 1921 


Labor cost (all employees) per Number of establishments 
1,000 feet, board measure (frequency) 

$1 .00-$1.49 3 
1.50- 1.99 10 
2.00— 2.49 ; 14 
2.50— 2.99 22 
3.00-— 3.49 38 
3.50-— 3.99 40 
4.00-— 4.49 38 
4.50-— 4.99 33 
5.00-— 5.49 20 
5.50- 5.99 11 
6.00-— 6.49 10 
6.50-— 6.99 11 
7.00-— 7.49 8 
7.50- 7.99 4 
8.00-— 8.49 4 
8.50-— 8.99 3 

269 


The upper part of Fig. 35 indicates the method by which 
the ogive is built up. Just as in the histogram, the area of 
each rectangle is proportional to the number of cases falling 
in the given class. Since the operation is a cumulative 
one, however, the base of each rectangle is the cumulated 
frequencies of all preceding classes. ‘Thus the y-value (fre- 
quency) of the first rectangle is 3, erected from zero as a 
base, the y-value of the second class is 10, erected from 3 
as a base, and so on. The slope of the curve connecting 


1 From “Labor Efficiency and Productiveness in Sawmills,” Ethelbert Stewart, 
Monthly Labor Review, January, 1923, 14. Seven scattered cases above $9.00 in 
value have been omitted from the table and the accompanying graph. 


96 THE ORGANIZATION OF STATISTICAL DATA 


these rectangles is gradual at first when the frequencies 
are low, then steeper as the frequencies become greater, 
and finally tapers off as the frequencies decrease near the 
upper limit of the distribution. This is the cumulative 
frequency curve, or ogive. 

When the various rectangles representing the class- 
frequencies are dropped to the zero line as a common base, 
the x-values remaining the same throughout, the histogram 
or column diagram described in an earlier section is secured. 
From this the frequency polygon or smoothed frequency 
curve may be derived. 


REFERENCES 


Bowtey, A. L. Elements of Statistics (52-81). 

Cuappock, R. E. Principles and Methods of Statistics (Chaps. 
TV; V): 

Day, Epmunp E. Standardization of the Construction of Statisti- 
cal Tables. Quarterly Publications, American. Statistical 
Association, March, 1920. 

Jonns, D. C. A First Course in Statistics (5-21). 

Keuiry, Truman L. Statistical Method (1-37.) 

Kine, W. I. Elements of Statistical Method (83-107). 

PeaRL, Raymonp. Medical Biometry and Statistics (74-104). 

Ruae, H. O. Statistical Methods Applied to Education (57-94). 

Secrist, Horace. Introduction to Statistical Methods (116-157). 

Secrist, Horacr. Readings and Problems in Statistical Methods 
(242-271). 

Yutr, G. Upney. An Introduction to the Theory of Statistics 
(75-105). 


CHAPTER IV 


DESCRIPTION OF THE FREQUENCY DISTRIBUTION: 
AVERAGES 


The classification of quantitative data and the construc- 
tion of a frequency distribution constitute an important 
stage in the task of organization and analysis. By means 
of classification the underlying structure of the data may 
be revealed and the essential unity of a mass of material 
may be brought out. But this is only the first step in 
statistical analysis. It remains to develop methods of 
measuring and expressing more concisely the significant 
characteristics of a body of data. For certain purposes the 
frequency distribution itself must be summarized and con- 
densed, must be boiled down until its essence has: been 
distilled into three or four significant figures. 

If each frequency distribution constituted a novel and 
unique problem, obeying a law peculiar to itself, the task 
of studying and describing such distributions would be a 
dificult one. Fortunately this is not so. Quantitative 
data in widely different fields, when assembled in frequency 
distributions, show certain common characteristics, obey 
certain general laws. Experience in one field, therefore, 
constitutes a guide to work in others. Uniformity in the 
behavior of masses of data makes possible the development 
of a generalized method of organizing, analyzing and com- 
paring measurements drawn from many fields of scientific 
study. 


COMPARISON OF FREQUENCY DISTRIBUTIONS 


This fact of a common law of arrangement running 
through the universe of quantitative facts may be brought 
home most effectively by a comparison of distributions illus- 
trative of various types of data. The characteristics of the 

97 


98 FREQUENCY DISTRIBUTION 


frequency distributions and of the frequency curves which 

follow should be noted, and the distributions compared. 
The curve in Fig. 36 is based upon the following data 

relating to the heights of 18,780 soldiers in a certain army.? 


Le 
BREES Ze GeeEY 
BRED RRR NEEEE 


TOA 
1500 
Moe eS NE ae 


HALE NE 
ee ae 


60 61 62 65 64 65 66 6/7 68 69 70 71 72 73 
Heigh ¢ in Inches 


Fic. 36. — Frequency Curve: Distribution of 18,780 Soldiers Classified according 


to Height 

TABLE 16 
Distribution of Soldiers Classified According to Height 

Height in Inches Number of Soldiers Height in Inches Number of Soldiers 

60 + 197 67 + 3,017 
61+ 317 68 + 2,287 
62 + 692 69 + 1,599 
63 + ; 1,289 7O+ 878 
64 + 1,961 Ti+ 520 
65 + 2,613 724+ 262 
66 + 2,974 73+ 174 
Total 18,780 


1 From G. C. Whipple, Vital Statistics, 377. 


FREQUENCY DISTRIBUTION 99 


Figure 37 depicts a frequency curve based upon the 
errors of observation in 470 astronomical measurements 
made by Bradley.1. The data as presented are in the 
following form: 


WD SISSASL10AN S456 7H AO 
* Magnitude of error (in parts ofa second of arc) 


Fie. 37. — Frequency Curve: Distribution of Errors of Observation 
(Astronomical) 


TABLE 17 
Distribution of Errors of Observation in Certain Astronomical 
Measurements 

Magnitude of Errors in Parts Number of Errors of each 
of a Second of Arc, Between Magnitude 

0.0 and 0.1 94 

OF e072 88 

OD acme OLS 78 

(Use Ue 58 

Ona 025 51 

OLSe 5006 36 

ae A ler 26 

Dee 25 Cie. 14 

OF8ae70.9 10 

ORD eee 10 7 

above 1.0 ee 

Total 470 


! As given by Mellor, Higher Mathematics for Students of Chemistry and 
Physics, 514. 


100 FREQUENCY DISTRIBUTION 


These data, when plotted, will give but one-half of a 
frequency curve, as no attempt has been made to separate 
the errors below the true value from those above the true 
value. The curve has been completed in the figure on the 
assumption, which is quite valid in this case, that the 
number of errors of each magnitude below the true value 
equaled the number of errors of that magnitude above the 
true value. 

If a piece of artillery be accurately adjusted on a given 
target (a point) and 100 shots be fired, it will be found 
that the points of impact of the hundred shots will be 
dispersed about the target. No matter how accurate the 
piece or the adjustment only a small percentage of the 
shots will fall upon the exact point at which they were 
directed. The points of impact will be scattered about the 
target in a quite regular fashion, however. If a rectangle 
_ be so drawn as to include all the points of impact, and this 
rectangle (or zone of dispersion) be divided into eight equal 
parts, the distribution of shots within these sections will be 
as indicated in the following diagram. (In any given case 
there are likely to be slight departures from this order, but 
in the long run this distribution will prevail.) 


2/1 | j2s}as) a6) 7 | 2 


Fia. 38. — Zone of Dispersion, Artillery Firing, Showing the Theoretical 
Percentage Distribution of Shots 


This general rule holds for all classes of guns. The more 
accurate the gun the smaller will be the zone of dispersion, 
but the distribution within this zone is theoretically the 
same in all cases. Rules. of fire used in artillery adjustment 
are based upon this fact. 

The results of actual firing may be contrasted with this 


FREQUENCY DISTRIBUTION 101 


theoretical distribution. The accompanying table presents 
a record of one thousand shots fired from a battery gun at 
the middle of a stationary target two hundred yards 
distant. The target was divided by horizontal lines into 
eleven equal divisions. ; 


TABLE 18 
Distribution of One Thousand Shots from a Single Gun 

Division Number of shots recorded 
1 (top) 1 
Q 4 
3 10 
4 89 
5 190 
6 212 
tf 204 
8 193 

39 79 

10 16 

11 (bottom) Q 
Total 1,000 


These results are presented graphically in Fig. 39. 

The zone of dispersion being divided into eleven divisions 
instead of the eight referred to in describing the theoretical 
distribution, a direct comparison cannot be made. We 
have here, however, the same general type of distribution 
found in the other examples given. A slight tendency 
toward concentration in the lower half of the target is 
undoubtedly due to failure to correct completely for 
gravitation in the laying of the gun. 

When coins are tossed the distribution of heads and 
tails is assumed to be determined by pure chance. In a 
single experiment ten coins were tossed 100 times. The 
following table shows the frequencies with which given 

1 This experiment is recorded in the Report of the Chief of Ordnance, 1878, 


Appendix S. The results are given in The Method of Least Squares, Mansfield 
Merriman, N. Y., Wiley, 1897, 14. 


102 FREQUENCY DISTRIBUTION 


numbers of heads appeared. (The greatest number of heads 
possible in a given throw under such conditions is, of 
course, 10; it is also possible that no heads should appear.) 


Number of Shots 
Co 
oO 


is} 


Li gle, 52 40 5Su Ould eetne aaOnEan 
Divisions 
Fia. 39. — Column Diagram: Distribution of 1,000 Shots from 
a Single Gun 


TABLE 19 


Distribution of Results in Coin Tossing Experiment 
(Ten coins tossed 100 times) 


Number of Heads Frequency of Occurrence 
10 0 
9 1 
8 4 
7 if 
6 23 
5 30 
4 20 
3 9 
g 5 
1 1 
0 0 


FREQUENCY DISTRIBUTION 103 
Figure 40 depicts the above frequency distribution 


Bel beck = too) 

ARREARUAEE 

HAH 
ae 


CORE 
Vers 


Ome lo 22 1 EUG ato SaelO 
Number of heads turned up 


as 


Fic. 40. — Frequency Polygon: Distribution of Heads in a Coin-Tossing 
Experiment 


DISTRIBUTION OF Economic DATA 


We find in these four widely different fields something 
approaching a uniform law of arrangement of quantitative 
data. The examples which have been given, however, do 
not represent the world of economic facts. Do economic 
data show the same general characteristics? If reference 
be made to the examples given in Chapter III, comparisons 
with the four preceding illustrations may be made. The 
frequency distributions referred to are those relating to 
weekly earnings of employees, the length of life of tele- 
phone poles, the distribution of labor cost in sawmills and 
the distribution of incomes below $4,000 in the United 


104 FREQUENCY DISTRIBUTION 


States. (The curve of the latter distribution, it should be 
noted, would show a long tail extending far to the right if 
the incomes above $4,000 were included.) Several additional 
examples of economic data may be given. 

Figure 41 illustrates the order in which price variations 
are distributed. It is based upon a study made by 


quency 
S 
a 


Fre 
8 
fo) 


5 40 30 20 100 0 10 20 30 40 5O 
Percentage of Fall Percentage of Rise 


Fic. 41. — Frequency Polygon: Distribution of 5,540 Cases of Change in the 
Wholesale Prices of Commodities from One Year to the Next (After Mitchell) 


W. C. Mitchell of 5,578 individual cases of change in 
the wholesale prices of commodities from one year to the - 
next.' Thus, for example, the average price of middling 
upland cotton in New York in 1912 was $0.115 per pound. 
In 1913 the average price was $0.128 per pound, an increase 


1 From Bulletin 284, U. 5S. Bureau of Labor Statistics, Part I, “The Making 
and Using of Index Numbers,” 18. The figure shows the price changes only 
within the range of a 51% fall and a 51% rise. One case of a price fall of 55% 
is not shown, and 37 cases of price increases ranging from 52% to 104% have not 
been included. 


FREQUENCY DISTRIBUTION 105 


of 11.83%. This would constitute one entry in the table of 
rising prices, falling in the class “10-11.9%.” The entire 
table consists of 5,578 such entries. These data are pre- 
sented in Fig. 41 in the form of a frequency polygon, no 
attempt being made to smooth the curve. 

Table 20 shows the distribution of London-New York 
exchange rates (sterling exchange) during the period 1882- 
1913, inclusive. The time factor has been ignored and 


CELE 
NEEL 
HH ct 


~~ ~ = ~~ ~~ ~ SS) Grr ey 

SO pm pPeGrerES SERS = 

oD eo Qannonnngwno ao OQ fox 

toads SREZSERSZAR RSG 
Dollars 


Fic. 42. — Frequency Polygon: Distribution of London-New York Exchange 
Rates (as recorded over a period of 384 months) 


monthly rates have been classified according to the fre- 
quency of their occurrence over this thirty-two year period.! 
The data are presented graphically in Fig. 42. 


1 “The figures are ...... the averages of those quoted at the beginning of each 
month in the Economist; on and after July, 1886, the exchange is the ‘telegraphic 
transfer’, before that date, ‘short at interest.’’’ The data are taken from An 
Academic Study of Some Money Market and Other Statistics, by E. G. Peake. 
London, P. S. King, 1923. Appendix I. 


106 FREQUENCY DISTRIBUTION 


TABLE 20 


Distribution of London-New York Exchange Rates as Recorded 
by Months During the Period 1882-1913 


Frequency 
Class-interval (Number of months given 
rate prevailed) 

$4 .8275-$4 . 8324 1 
4 .8325-— 4.8374 6 
4.8375-— 4.8424 11 
4. 8425-— 4.84774 Q1 
4.8475— 4.8524 23 
4.8525-— 4.8574 QA 
4.8575-— 4.8624 20 
4. 8625— 4.8674 40 
4.8675— 4.8724 45 
4.8725— 4.8774 49 
4.8775— 4.8824 35 
4.8825-— 4.8874 45 
4.8875— 4.8924 33 
4.8925— 4.8974 16 
4.8975— 4.9024 8 
4.9025— 4.9074 1 
4.9075— 4.9124 a 
384 


The frequency curves and histograms based upon eco- 
nomic data, it will be noted, do not all show the symmetry 
and regularity which seem to characterize the curves 
representing physical data. Some are skewed to the right 
or left, there are breaks in the regularity of the increase or 
decrease of frequencies in certain of the examples. But in 
spite of these differences there is obviously a family re- 
semblance between the measurements drawn from the 
fields of economics, astronomy, anthropometry, ballistics, ° 
and pure chance. Certain of the common characteristics 
may be noted. 


GENERAL CHARACTERISTICS OF FREQUENCY DISTRIBUTIONS 


There is, in the first place, variation in the values of the 
measurements secured. Human heights vary, astronomical 
measurements of the same quantity differ, projectiles fired 


FREQUENCY DISTRIBUTION 107 


under conditions as nearly constant as it is humanly 
possible to make them fail to land at the same spot, 
incomes vary as between individuals, and exchange rates 
move from week to week and month to month. The 
various observations or values secured in a given case are 
distributed along a scale, between two extreme values. 

The distribution of these values along the scale (the 
x-axis) is such that, moving from one extreme value 
towards the other, the cases found at successive points 
along the scale (the successive class frequencies) increase 
with more or less regularity up to a maximum, and then 
decrease in much the same way. In spite of variation, 
therefore, we find a central tendency, a massing of cases at 
certain points on the scale of values. This is the second 
notable characteristic which all the frequency distributions 
appear to possess in common. 

If we measure, for each of the successive classes, the 
amount of deviation along the scale from the point of 
greatest concentration it will be noted that small deviations: 
are much more frequent than large ones, that extreme 
deviations are rare, and that deviations on both sides of 
the point of concentration reach perfect (or almost perfect) 
equality in the examples taken from the physical sciences 
and from the field of pure chance, and approximate equality 
in the economic distributions. (Exceptions to this rule 
of approximate equality on the two sides of the point of 
greatest concentration are not infrequent, the example of 
income distribution being a rather striking case in point.) 

Figure 43 depicts a curve which is termed the “‘proba- 
bility curve,” or the ‘‘normal curve of error.” Its charac- 
teristics will be discussed in greater detail in a later section. 
At this point it is presented merely as a basic type which 
some of the above examples approach closely, and from 
which others of the examples represent more or less pro- 
nounced deviations. Departures from this type, let it be 
emphasized, are numerous and significant, but as a basic 


108 FREQUENCY DISTRIBUTION 


form this normal curve of error is extremely important in 
statistical work. Even the most important variations from 
this type resemble it with sufficient closeness to justify the 
use of a generalized method of describing frequency distri- 
butions. Distributions of quantitative data vary, and their 


Fic. 43. — The Normal Curve of Error 


variations from each other and from certain standard types 
are of the greatest significance, but in spite of their varia- 
tions a family resemblance runs through them all. Each 
new frequency distribution is not an isolated phenomenon, - 
but a member of a large family, and as such the problem 
of describing and analyzing it may be approached with 
confidence in methods which have been found applicable in 
other cases. 

Given this more or less common type, how may a given 
distribution be described and differentiated from others? 
Certain methods will have been suggested by the preceding 
discussion. 


FREQUENCY DISTRIBUTION 109 


Metruops oF Descrisinec A FrEeQuENcY DISTRIBUTION 


The values of all the observations, it has been noted, are 
spread along a scale. The frequency distribution may be 
described by the selection of a single value on that scale 
which is thoroughly representative of the distribution 
as a whole. Since the frequencies vary, an obvious choice 
is the selection of that value which occurs the greatest 
number of times, or, in other words, that point on the 
scale at which the concentration is greatest. This value 
constitutes a measure of the central tendency of the distri- 
bution. Thus, one might find the income class in which 
the greatest number of people fall, and let the mid-point of 
that class (which is $950 in the distribution presented in 
Table 11) serve as the representative of the distribu- 
tion. This most common value, it should be noted, is only 
one of several possible measures of the central tendency 
of a given distribution. All such measures are termed 
averages. 

A single representative value such as this has many uses 
but, by itself, it obviously leaves out many facts concern- 
ing the distribution. Of great importance is the character 
of the distribution about the average. Are the values of all 
tabulated cases closely concentrated, or is there pronounced 
dispersion over a wide range? The representative character 
of any average depends upon how closely the other values 
cling to it, upon the degree of concentration about the 
central tendency. The average, therefore, must be supple- 
mented by a measure of variation, a measure of the “scatter” 
about the central value. 

An adequate description should include also an account 
of the degree of symmetry of the distribution. It is highly 
important to know whether there is an equal distribution 
of cases on each side of the point of greatest concentration, 
or whether the frequency curve is skewed to one side, 
as in the case of income distribution illustrated above. 


110 FREQUENCY DISTRIBUTION 


If the curve is not symmetrical the degree of asymmetry 
should be determined, and for this purpose measures of 
skewness have been developed. 

It is, finally, possible to measure the degree of peaked- 
ness of frequency curves, by comparing them with the 
normal curve of error as a standard. It is obvious that the 
frequency polygon representing price changes (Fig. 41) 
would, if smoothed, constitute a curve much more peaked 
than the normal curve, and this fact of pronounced con- 
centration at the central value is highly significant. This 
characteristic of frequency curves is called kurtosis, and 
the measurement of kurtosis constitutes the final step in 
the description of the frequency distribution. 

When these various measures have been secured the task 
of statistical analysis will be well under way. The chaotic 
assortment of data with which we started will have been 
reduced to workable form in the shape of a frequency table, 
and the essential facts which the table reveals will have 
been distilled into three or four significant measures. This 
process not only reveals the characteristics of the given 
distribution, but also facilitates comparison with similar 
distributions. For example, it is impossible to compare 
some tens of millions of unorganized personal income figures 
for the United States with similar data for Great Britain. 
But if we secure a value for the average or most repre- 
sentative income for each country, together with a descrip- 
tion of the distribution of personal incomes about that 
central value, a legitimate basis for comparative study is . 
obtained. In manipulating and analyzing masses of ma- 
terial, whatever the purpose of study may be, full use 
should be made of the power to condense, simplify and 
compare which is given by the measures employed in 
describing the frequency distribution. 

The succeeding section is devoted to a discussion of one 
phase of this descriptive process, that concerned with the 
measurement of central tendencies. After the development 


FREQUENCY DISTRIBUTION 11 


of this subject of averages, problems relating to measures 
of variation and of skewness will be dealt with. 


AVERAGES 


We have seen that the representation of a frequency 
distribution by an average, a single typical figure, is justi- 
fied because of the tendency of large masses of figures to 
cluster about a central value, from which the values of all 
observed cases depart with more or less regularity and 
smoothness. It is solely because of the concentration of 
cases about a central point on the scale that such repre- 
sentative figures have significance. The average represents 
the distribution as a whole only because it is a typical 
value. If the individual items entering into a distribution 
vary widely in value and show no tendency toward con- 
centration, no single value can represent them. Thus the 
arithmetic mean of the three numbers 3; 125; 1000 is 376, 
but 376 in no way represents the three values on which it 
is based. This fundamental requirement, that there be a 
tendency toward concentration about a central value, must 
be met if an average is to be at all representative. 

If the general character of a frequency distribution be 
recalled the logic of one sort of average will be clear at 
once. It was suggested above that that point on the 
x-scale at which the concentration is greatest, that value 
which occurs the greatest number of times, might be taken 
as typical of the entire distribution. This value is termed 
the mode, and the group in which it falls is called the modal 
group. If a frequency curve be drawn to represent a given 
distribution, the mode will be the x-value corresponding to 
the maximum ordinate! The maximum ordinate itself 
measures the frequency of the modal group. Students fre- 
quently confuse these two values in determining the mode. 


1 Strictly speaking, the mode is the z-value corresponding to the maximum 
ordinate of the ideal frequency curve which has been fitted to the given distribution, 


112 FREQUENCY DISTRIBUTION 


It is not the distance along the y-scale but the distance 
along the x-scale which measures the value of the mode. 
The ordinates merely measure the number of cases falling 
in the several classes, not the values of the cases falling in 
those classes. 

As typical of a given distribution we might also select 
that point on the scale of x-values on each side of which 
one half the total number of cases fall. This value, which 
is called the median, is that which exceeds the values of 
one half the cases included, and is in turn exceeded by the 
values of one half the cases. Thus it has been estimated 
that in 1918 the median value of personal incomes in the 
United States was $1,140; one half of the 37 million recipi- 
ents of personal incomes received less than this sum, while 
one half received more. When a distribution is represented 
by a frequency curve, the area under the curve is divided 
into two equal parts by an ordinate erected at that point 
on the z-axis corresponding to the median value. This 
follows, of course, from the definition of the median, and 
from the fact that the area under a frequency curve repre- 
sents the total number of cases included in the distribution. 

The arithmetic mean is a third type of average which 
may be used to represent a distribution. This is a calcu- 
lated average, affected by the value of every item in the 
distribution. Herein, obviously, it differs from the mode 
and the median, which depend primarily upon the relative 
position of the items in the frequency table, and are not 
affected by the values of all individual items. The arith- 
metic mean is the center of gravity of a distribution; it~ 
would be the 2-value of the point of balance of a frequency 
curve, if the curve could be blocked out and manipulated 
in solid form. 

The geometric mean and the harmonic mean are two other 
averages the characteristics of which will be discussed at a 
later point. 

The computation or location of these various averages 


FREQUENCY DISTRIBUTION 113 


may involve somewhat lengthy processes if the number of 
cases included be great. If appropriate methods be em- 
ployed, however, the labor of computation may be materi- 
ally cut down. The use of the following symbols will 
simplify the explanation of these methods: 


M: Arithmetic mean. 
Mo: Mode. 
Md: Median. 

m: The value of an individual observation; in a 
frequency distribution, the value of the mid- 
point of a class. 

f: The number of cases in a given class in a fre- 
quency distribution. 

N: The total number of cases in a given series or fre- 
quency distribution. 

2 (Sigma): The symbol for the process of summation, mean- 
ing “the sum of.” 


Tue COMPUTATION OF THE ARITHMETIC MEAN 


Using the above notation, the formula for the arithmetic 


mean is 


>m 
aa 


Thus the mean of the measures 2, 5, 6, 7, is equal to the 
20 
sum of these measures divided by 4, which is fe 5. The 


computation of the arithmetic mean when each measure is 
reported at its true value is thus a simple process of sum- 
mation and division. The weekly earnings of 210 factory 
employees were listed in an earlier section. If these figures 
be added, and the total divided by 210, the mean weekly 
wage is found to be $26.982. In this case the task of add- 
ing 210 items is somewhat tedious; it is a task which 
would become almost impossible if one were dealing with 
the 37 million personal income figures, for example. For 


114 FREQUENCY DISTRIBUTION 


practical reasons, therefore, it is usually necessary to com- 
pute the required averages from the frequency distribution 
rather than from the original ungrouped data. To exemplify 
this process certain data relating to the yield of wheat in 
North Dakota may be utilized. The average yield of 
wheat per acre in each of the 53 counties of North Dakota 
for the period 1911-1921 has been recorded by the U.S. 
Bureau of Agricultural Economics. The figures are sum- 
marized in the following table: 


TABLE 21 


Calculation of the Arithmetic Mean of Wheat Yield in the Counties 
of North Dakota, 1911-19211 


Class-interval Mid-point Frequency 

(in bu. per acre) m i; fm 
0- 1.9 1 SS 3 

Q2— 3.9 3 26 78 

4- 5.9 5 78 390 

6-— 7.9 ff 107 749 

8- 9.9 9 113 1,017 
10-11.9 11 65 Os 
12-13.9 13 40 520 
14-15.9 15 29 330 
16-17.9 17 45 765 
18-19 .9 19 41 779 
20-21 .9 Q1 aH | 441 
22-23 .9 23 8 184 
569 5,971 

=(fm) 5,971 
M N 569 10.49 bu. 


The importance of certain of the precautions mentioned . 
in the section on classification will be clear from this ex- 
ample. It was there stated that the class-interval should 
be so selected that the error in assuming an even distribu- 
tion of cases throughout each class-interval would not be 
great. If the cases in each class are evenly distributed we 

1 Average yields, by counties, for this period are given in Bulletin 165, Agri- 


cultural Experiment Station, North Dakota Agricultural College, 1922, Cost of 
Production and Farm Organization on 126 Farms in North Dakota, 1921, 120-21. 


FREQUENCY DISTRIBUTION ~ 115 


may take the mid-point of that class as representative of 
all the cases included, but if there is not an even distribu- 
tion the mid-point of the class ceases to be representative. 
The process of calculating the value of the mean from the 
frequency distribution rests upon the assumption of an 
even distribution of cases within each class.. The mid- 
point of each class is multiplied by the number of cases in 
that class, it being assumed that the product is approxi- 
mately equal to the sum of all the individual items in the 
class. The formula for the mean thus becomes M = ete 
Table 21 illustrates the procedure in detail. 

The value secured in this way is sometimes called a 
weighted arithmetic mean. What we do, in effect, is to 
secure the arithmetic mean of the 12 figures in the column 
headed m. We do not take a simple average of these 
figures, however, but weight each one in proportion to the 
number of cases falling in the class-interval of which it is 
the mid-point. It is precisely the same procedure as that 
which we should follow in calculating the mean of five 
men’s incomes, two of whom, let us say, have incomes of 
$2,000 and three of whom have incomes of $3,000. Clearly 
it would not do to add the figures $2,000 and $3,000, dividing 
the sum by two. The figure $2,000 is given a weight of 2, 
the figure $3,000 is given a weight of 3, and the resultant 
sum, $13,000, is divided by 5. Though the procedure in 
working from the frequency distribution is thus a form of 
weighting, the term “weighted average”’ is coming to have 
a more restricted meaning, to be explained at a later point, 
and should not in general be applied to an average com- 
puted from a frequency distribution. 


Syort Meruop or ComputinGc THE ARITHMETIC MEAN 


The calculation of the arithmetic mean from the fre- 
quency table is much easier, in general, than from the un- 


116 FREQUENCY DISTRIBUTION 


grouped data, but when the number of cases included is 
large even the computation from the frequency table by 
the method illustrated above may be laborious. The pro- 
cedure may be greatly simplified. 

From the method of computing the arithmetic mean it 
follows that the algebraic sum of the deviations of a series 
of individual magnitudes from their mean is zero. This 
may be readily demonstrated. We may represent the series 


of magnitudes by m, m2, m3,.. . Mn, their arithmetic 
mean by M, and the deviations of the various magnitudes 
from the mean by dj, ds, d3, . . . dn. 
Then 
Mex Mi + Ma + Ms +... Pilly (1) 
N 
and 
NM =m +m+m3+...+mMn (2) 


The number of terms, of course, is equal to N. Therefore 
Gay) BG =I ing = R eee 
But 


m, — M =d, m,. — M = d,, etc., and equation (3) may be written 
2d = 0. 


Knowing this to be true we may measure the deviations 
of a series of magnitudes from any arbitrary quantity, 
secure the algebraic sum of the deviations, and from this 
value ascertain the difference between the arbitrary quan- 
tity and the true mean. For this difference will be the 
mean of the deviations from the arbitrary origin. If we - 
let M’ represent the arbitrary origin, or assumed mean, 
while ¢ = M — M’, and dy’, d,’, ds’ . . . dy’, represent the 
deviations of the various magnitudes from M’ (i.e. di’ 
=m, — M’, d’ = m,— M’, etc.), then 


dy’ =dite, dad’ =d,+¢, ds) =ds+ec, ... dn =dy +e 
and 
2d’ = Xd + Ne 


FREQUENCY DISTRIBUTION 117 


But 
2d =0 
Dd’ = Ne 
and 
N 


From the known values of M’ and ¢ the value of the true 
mean may be obtained, for M = M’+ cc. The procedure is 
illustrated in the following simple example: 


TABLE 22 


Computation of the Arithmetic Mean (Short Method) 
(Ungrouped data) 


m ij d 

5 1 Eis M’ = 20 

15 1 Eas ; 

25 1 Ti5 eg 2 ae a 

35 1 415 N 5 

45 1 + 25 M = M’ +¢= 2045 = 25 
5 + 5 


When the deviations are measured from 20 as arbitrary 
origin there is in each case a constant error, if the devia- 
tion from the true mean be taken as standard. This error 
is equal to the difference between the true and the assumed 
means. The algebraic sum of the deviations from the 
assumed mean will equal N times this constant error, since 
the error is repeated once for every item included. By 
dividing the sum of these deviations by N the amount of 
the error may be determined and the value of the mean 
thus obtained. 

The work of computation may be still further abbrevi- 
ated in the case of quantities arranged in the form of a 
frequency distribution by measuring the deviations in 
terms of the class-interval as a unit. Then, in finally ap- 
plying the necessary correction, the difference between the 
true and assumed means may be again expressed in terms 
of the original units. The method may be illustrated in 


118 FREQUENCY DISTRIBUTION 


detail by means of the wheat yield data for which the 
mean has already been calculated. 


TABLE 23 


Calculation of the Arithmetic Mean of Wheat Yield in the Counties 
of North Dakota, 1911-1921 


(Short Method) 


d' 


Class-interval | Mid- | Fre- Ge cline fa! Calculations 
Epes P si quency | interval - + Was 
P units) cr 
0- 1.9 1 3 —4 12} eee 1. Algebraic sum of de- 
2 359 3 26 —3 8 ence ae viations from M’ 
4— 5.9 5 78 —2 T56 se. nee +778 
6-— 7.9 # 107 eee | LOW eee — 353 
8-— 9.9 9 113 Oi Oe eee + 425 
10 —11.9 Val 65 Tih Seren 65 | 2. Calculation of ¢ 
12 —13.9 13 40 RRO ek SE 80 (in class-interval 
14 -—15.9 15 22 Sy Alana 66 units): 
16 —17.9 iG 45 AGA |e eee 180 a 425 71469 
18 — 19.9 19 4d So lean 205 3 
20 — 21.9 Q1 21 Gpvileageer 126 | 3. Reduction of ¢ to 
22 — 23.9 23 8 Viemak, | sa eee: 56 original units: 
— ce Class-interval = 2 
569 — 353 | + 778 ce (in original 
units) 
= .(4695c9 
= 1.4938 
4. Determination of 
M 
M=M’'+c 


M =9+1.4938 
M = 10.4938 bu. 


The steps in this process of calculating the arithmetic 
mean by the short method may be briefly summarized: 


1. Organize the data in the form of a frequency distribution. 

2. Adopt as the assumed mean the mid-point of a class near 
the center of the distribution. 

3. Arrange a column showing the deviation (d’) from the 
assumed mean of the items in each class, in terms of 
class-interval units. This deviation will be zero for the 


FREQUENCY DISTRIBUTION 119 


items in the class containing the assumed mean, —1 
for the items in the next lower class, + 1 for the items 
in the next higher class, and so on. 

4. Multiply the deviation of each class by the frequency of 
that class, taking account of signs. These products are 
entered in the column fd’. 

5. Get the algebraic sum of the items entered in the column 
fd’. 

6. Divide this sum by the total frequency (NV). The quotient 
is the correction (c) in class-interval units. 

7. Multiply the correction (¢) by an amount equal to the 
class-interval. The product is the correction in terms 
of the original units. 

8. Add this correction (algebraically) to the assumed mean 
(M’); the sum is the true mean (M). 


LocaTION oF THE Mep1an: UNGRovuPrED Data 


The median is a value of the x-variable so selected that 
50 per cent of the total number of cases, when arranged in 
order of magnitude, lie below it and 50 per cent above it. 
For many frequency distributions this is a useful and 
significant value. 

When handling data which are not arranged in the form 
of a frequency distribution the location of the median is a 
simple matter. The data having been arranged in order of 
magnitude, it is necessary only to count from one end 
until that point on the scale of values is found which 
divides the number of cases into two equal parts. As a 
simple example we may assume the following seven figures 
to represent the annual incomes of seven individuals: 


$750 $975 $1128 $1450 $1475 $1825 $1950 


The scale of values extends from $750 to $1950, and 
seven items are arranged along this scale. The value of 
$1000 has two items on one side and five items on the 
other, so obviously does not conform to our definition of 


120 FREQUENCY DISTRIBUTION 


the median. The value of $1450, which corresponds with 
the income of one of the seven individuals, is the median 
in this case. Three items lie on each side of this value; or, 
if we assume the central item to be cut in two, 33 items lie 
on each side of this point. This case is illustrated in Fig. 44. 
This diagram may help to bring out the fact that the 
median is a point on a scale, so located that it cuts the 
frequencies in two. 

The problem is slightly different when an even number 
of cases is included. This condition is exemplified in the 
following table which shows the average earnings of band 
sawyers, in sawmills in each of 22 States, during the 
year 1921. 


TABLE 24 
Average Earnings of Band Sawyers by States, 19211 


Average earnings 


State per one-man hour 
Virginia ever  eiei $0 .586 
Rennsylvaniamenmeact earns .621 
Tennessee ns cdots ates Soe .627 
INorths Caro linapeeie eee 647 
Wests Virginians ete .658 
Maines either ree et ea .686 
Southm@arolina se eeeeeee 721 
IWisCOnSi ni tt aS she ree Rees £ 729 
Michigan eri ernie ne 730 
Alabama tes ae 749 
Minnesota. aie ce 755 
PGR AS taricks scope eee ene eee 761 
ChOOTIIA ord cee Veneers 779 
Arkansas... slo a seme eee 185 
IMLISSISSID DT oe bats « cee eee 798 
ouisianalcis aren nee oe 824 
Hlorida... 2k ee eee .825 
ldgho we ake eee 833 
Califormiayes serene 864 
Montane se Senet 904 
‘Washington... io 05h eee 1.045 
Oregotins:, oF is eae eee 1.106 


1 From Monthly Labor Review, January, 1923, 9. 


FREQUENCY DISTRIBUTION 121 


The median must be in this case a value on each side of 
which 11 States lie. Therefore any value exceeding $0.755 
(average earnings in Minnesota) and less than $0.761 
(average earnings in Texas) will satisfy the definition of 
the median. Under these conditions, where the median is 
really indeterminate, a value half way between the two 


(a) (e) (e) (f) (g) 
8750 5 18 as41S 1825 1954 
1450 


1000 1250 1500 1750 2000 
Median 


Income Scale in Dollars 


Fic. 44. — Illustrating the Location of the Median with Ungrouped Data 
(Personal incomes of seven individuals) 

limiting values is accepted, by convention. The median of 

the 22 figures given would thus be $0.758. 

In this example the median value does not correspond 
with the earnings in any one State. This will frequently 
be so when there is an even number of cases. When the 
data being handled constitute a discrete series an unreal 
value, inconsistent with the nature of the data, may be 
secured for the median. Thus the median value of the 
discount rates on 60-90 day commercial paper in the New 
York market during the period 1900-1922 is found to be 
4.865 per cent, a thoroughly artificial rate which actually 
never prevailed in the market. 


LocaTION oF THE Mepian: GrovuprEeD DATA 


The task of locating the median is essentially the same 
when the data are in the form of a frequency distribution. 
The fact that the real values of the individual items are 
not known, because of the grouping by classes, complicates 


122 FREQUENCY DISTRIBUTION 


the problem slightly. The following data, representing the 
average discount rate on 60-90 day commercial paper in 
the New York market for 276 consecutive months, may be 
used in illustrating the method. 


TABLE 25 


Location of Median, Discount Rates, New York 
60-90 da. choice double name commercial paper, 1900-1922 
Class-interval Mid-point Frequency 
if 


(Discount rate in %) m 
2.75-3. 24 3.00 ee Ne eae 
3.25-3.74 3.50 31 2) 3 
3 .715-4 24 4.00 52 Md = 4.75 + (9/39 X .50) 
4. 25-4. '74 4.50 38 
4.71525 24> 5.00 39 = 4.75 + .115 
5. 25-5 .74 5.50 42 
5.75-6. 24 6.00 35 = 4.865 
6.25-6.74 6.50 13 
6.75-7 24 7.00 5 
7 .25-7 74 7.50 4 
7 .15-8 . 24 8.00 9 
276 


In the present case the location of the median involves 
the determination of that value on each side of which 138 
items lie. We may assume that we start at the lower end 
of the scale and move through the successive classes. 
When we reach the upper limit of the first class (that in- 
cluding items having values from 2.75 to 3.25) we have 
left behind us 8 cases, while 268 lie in front of us. When 
the upper limit of the second class is attained, 39 items 
have been passed; when the upper limit of the fourth class - 
is reached, 129 items have been passed. The upper limit 
of the fifth class has below it 168 items. Somewhere be- 
tween the lower and upper limits of the fifth class lies the 
desired point, that which has 138 items on each side of it. 
How far must we move through this class, from 4.75 to 
5.25 in order to reach this point? 

It will be recalled that, for purposes of calculation, the 
assumption is made that there is a uniform distribution of 


FREQUENCY DISTRIBUTION 123 


the items lying within any given class. Since before we 
reach the fifth class 129 cases have been counted, only 9 
of the 39 included in this class are needed to complete the 
desired number, 138. On the assumption of even distri- 
bution the required 9 cases will lie within a distance on 
the scale equal to 9/39 of the class-interval. The class- 
interval is .50; 9/39 of .50 is equal to .115. As we move 
up the scale, then, having reached 4.75, we proceed an 
additional distance equal to .115. At a point on the scale 
having a value of 4.865 is the dividing line on each side of 
which lie 138 cases. This is the value of the median. 

The process of computation is shown at the right of the 
frequency table. The following is a summary of the steps 
involved in the location of the median: 

1. Arrange the data in the form of a frequency distribution. 

2. Divide the total number of measures by 2; this gives the 
number which must lie on each side of the point to be 
located. 

3. Begin at the lower end of the scale and add together the 
frequencies in the successive classes until the lower limit 
of the class containing the median value is reached. 

4. Determine the number of measures from this class which 
must be added to the frequencies already totaled to give 
a number equal to N/2. 

5. Divide the additional number thus required by the total 
number of cases in the class containing the median. This 
indicates the fractional part of the class-interval within 
which the required cases lie. 

6. Multiply the class-interval by the fraction thus set up. 

7. To the lower limit of the interval containing the median 
add the result of the multiplication process indicated in 
(6). This gives the value of the median. 

The last three steps constitute merely a simple form of 

interpolation. 

The entire process may be reversed by beginning at the 
upper end of the scale and counting downwards. In this 
case the final operation is one of subtraction from the upper 
limit of the interval containing the median. 


124 FREQUENCY DISTRIBUTION 


If N is an odd number N/2 will contain a fractional 
value. The operation is precisely the same as that out- 
lined above. 

QUARTILES AND DECILES 


For many purposes it is desirable to locate on the scale 
of values, along which the items constituting a frequency 
distribution are ranged, points dividing the total number 
of measures in other ways. Similar to the median, which 
divides the total number of cases into two equal groups, 
are the quartiles, deciles and percentiles. The quartiles, 
as the term implies, are points on the scale which divide 
the entire number of measures into four equal groups, the 
deciles divide the number into ten equal groups, and the 
percentiles divide the total number of cases into 100 equal 
groups. Thus the first quartile is that point on the scale 
below which one quarter of the total number of cases lie 
and above which three quarters of the total number of 
cases lie. The second quartile and the median are identical 
values. The third decile is that point on the scale below 
which three tenths of the total number of cases lie and 
above which seven tenths of the total number of cases lie. 
In all cases the count begins at the lower end of the scale. 

Example. Location of the First Quartile (Q:), Discount Rates 
(See Table 25.) 
N/4 = 69 
Q: = 3.75 + (30/52 X .50) 
.038 


Example. Location of Fourth Decile, Discount Rates 
(See Table 25.) . 
N/10 = 27.6 Dg = 4.25 + 19.4/38 X .50 
4N/10 = 110.4 = 4.505 
A method of locating median, quartiles, deciles and 


percentiles graphically is explained below. 


Location oF THe Mopr 


The mode is the value of the x-variable corresponding to 
the maximum ordinate of a given frequency curve. The 


FREQUENCY DISTRIBUTION 125 


concept of a modal value is a thoroughly easy one to grasp. 
It is the most common wage, the most common income, 
the most common height. It is the point where the con- 
centration is greatest, a characteristic which is effectively 
brought out by Fechner’s term for this average, dichtester 
wert, or thickest value. It is not so easy, however, to locate 
the true modal value in a given case. In general statistical 
work an approximate value only is secured for the mode, 
but for most practical purposes this value is usually suf- 
ficiently accurate.1 

The method of determining this approximate modal 
value may be illustrated by reference to the following 
distribution: 


TABLE 26 


Frequency Distribution of Five Per Cent Bonds 


(The table is based upon quotations on the New York Stock Exchange on 
Oct. 31, 1923, on railroad and industrial bonds with coupon rate of 5 per cent) 


Quoted Price Mid-point Frequency 
Class-interval m 
Less than 71.5 30 
71.5— 73.49 UW 2Ro 2 
13.5-— 75.49 74.5 5 
75.5-— 77.49 76.5 5 
17.5— 79.49 78.5 5 
79.5-— 81.49 80.5 12 
81.5-— 83.49 82.5 13 
83 .5-— 85.49 84.5 15 
85 .5-— 87.49 86.5 11 
87.5— 89.49 88.5 17 
89.5— 91.49 90.5 Uy 
91.5— 93.49 92.5 33. 
93.5- 95.49 94.5 45 
95.5- 97.49 96.5 51 
97.5-— 99.49 98.5 59 
99 .5-101.49 100.5 23 
101.5-103.49 102.5 9 
103 .5-105 .49 104.5 al 
353 


1 A method of locating the mode more accurately is explained in the note 
appended to Chapter XV. 


126 FREQUENCY DISTRIBUTION 


There is a wide dispersion of the 30 cases falling below 
71.5, and the existence of this ‘““open-end”’ class makes it 
impossible to compute the value of the mean, as the table 
stands. The mode is, therefore, an appropriate average to 
employ in the present instance. 

The class having limits of 97.5 — 99.49 contains the 
greatest number of cases. This appears to be the modal 
group, and the mid-point of this class, 98.5, may be tenta- 
tively accepted as the value of the approximate mode. 
But with different classifications quite different values 
might be secured for the mode. When the original bond 
quotations are tabulated with varying class-intervals, the 
following results are secured. (Only the frequencies of the 
central classes are shown. It is not necessary, for this pur- 
pose, to present each of the tables as a whole.) 


(a) (6) (c) 
Class-interval = 1 Class-interval = 2 Class-interval = 3 
Class-interval f Class-interval if Class-interval Rf 


93.5— 94.49 17 94.5-— 96.49 54 91.5— 94.49 50 
94.5— 95.49 28 96.5- 98.49 50 94.5— 97.49 19 
95.5— 96.49 26 98 .5-100.49 52 97.5-100.49 ae 
96.5— 97.49 25 
97.5— 98.49 25 
98.5— 99.49 34 
99 .5-100.49 18 
100.5—-101.49 5 


With a class-interval of 1 a value of 99 is secured for the 
mode; with a class-interval of 2, but with different class 
limits than in Table 26, a value of 95.5 is obtained. With ° 
a class-interval of 3 a modal value of 96 is secured. Further 
changes in classification would give still other values. The 
mode thus appears to be a curiously intangible and shifting 
average. Its value, for the same data, seems to vary with 
changes in the size of the class-interval and in the location 
of the class-limits. 

These difficulties arise primarily from limitations to the 
size of the sample being studied. The true mode, that 


FREQUENCY DISTRIBUTION 127 


value which would occur the greatest number of times in 
an infinitely large sample, could be located exactly if we 
could increase indefinitely the number of cases included. 
For, given sufficient cases, the approximate mode approaches 
the true mode as the class-interval decreases. Grouping in 
large classes obscures details, and as these classes are re- 
duced in size more of the details are seen and a truer picture 
of the actual distribution is secured. But since most 
practical work is necessarily based upon relatively small 
samples, the increase in the number of classes reveals gaps 
and irregularities, and causes such a loss of symmetry and 
order that doubt arises as to where the point of greatest 
concentration really lies. The different tabulations of bond 
prices furnish an excellent example of this. 

By mathematical methods it is possible to obtain a value 
for the true mode without securing an infinite number of 
cases. The smoothing process has been briefly explained. 
One sort of smoothing is that which involves the fitting of 
an appropriate type of ideal frequency curve to the data 
of a given frequency distribution. This gives, theoretically, 
the distribution which would be secured by the process 
first indicated, that of decreasing indefinitely the size of 
the class-interval and increasing indefinitely the number of 
eases. The value of the x-variable corresponding to the 
maximum ordinate of this ideal fitted curve is the true 
mode. 

For most practical purposes approximate values of the 
mode are adequate, and these may be secured by much 
simpler methods. A first and rough approximation may 
be obtained by taking the mid-value of the class of greatest 
frequency, a method suggested above. If the general rules 
for classification which were outlined in an earlier section 
have been followed, this procedure will not generally involve 
@ gross error. 

It is possible, given a fairly regular distribution, to 
secure, by a process of interpolation within the modal 


128 FREQUENCY DISTRIBUTION 


group, a closer approximation than is obtained by accept- 
ing the mid-value of this group as the mode. Referring 
again to the tabulation of bond prices in Table 26 it will 
be noted that the distribution on the two sides of the 
modal class is not symmetrical. The modal class is that 
with a mid-value of 98.5. The class next below, with a 
mid-value of 96.5, contains 51 cases, while that next above, 
with a mid-value of 100.5, contains but 23 cases. The 
disproportion is continued in the succeeding classes below 
and above, more cases being bulked below the modal class 
than above. For other purposes we have assumed an even 
distribution of cases between the upper and lower limits of 
each class, but it is probable that this is not true of the 
modal class in the present case. Judging from the distri- 
bution outside this class, it is likely that the concentration 
is greatest in the lower half of the class-interval, that is, 
between 97.5 and 98.5. The mode, therefore, probably lies 
below the mid-value, 98.5, rather than precisely at that 
point. We may attempt to locate it within the group by 
weighting, assuming a pull toward the lower end of the 
scale equal to 51 (the number in the class next below) and 
a pull toward the upper end of the scale equal to 23 (the 
number in the class next above). This may be expressed 
by a formula, employing the following symbols: 


L = lower limit of modal class. 
fi = frequency of class next below modal class in value. 


fo = frequency of class next above modal class in value. 
7 = class-interval. 


The interpolation formula is 


fa 
hth 


Applying this formula to the bond price data presented in 
Table 26, we have 


Mo = 97. 5+ x2 = 97.54 .62 = 98.12 


FREQUENCY DISTRIBUTION 129 


A closer approximation may sometimes be secured by bas- 
ing the weights (represented by f2 and f:) upon the total 
frequencies of the two or three classes next above the 
modal class and the same number below. If three classes 
on each side are included in the present case, a value of 
97.91 is secured for the mode of bond prices. 

In some cases the problem of locating the mode is com- 
plicated by the existence of several points of concentration, 
rather than the single point which has been assumed in 
the preceding explanation. Thus in Table 9, representing 
the distribution of wages, with a class-interval of 25 cents, 
there are two definite modal points. A distribution of this 
type is called bi-modal; when plotted, a frequency curve 
having two humps is obtained. If the data are homogene- 
ous such a distribution is the result of paucity of data and 
of the method of classification employed. It may be due 
to the use of a class-interval too small, with respect to the 
number of cases included in the sample. An approximate 
mode may be determined in such cases by shifting the 
class-limits and increasing the class-interval, carrying on 
this process until one modal group is definitely established. 
This reverses the process by which the true mode may be 
located when the number of cases is infinitely large. Under 
such conditions the class-interval might be reduced until 
it was infinitely small. But with a limited number of cases 
the location of the point where the concentration is greatest 
necessitates increasing the size of the class-interval, in order 
to get away from the irregularities due to the smallness of 
the sample. 

If the distribution remains bi-modal in spite of changes 
in the class-intervals and class-limits, it is probable that 
the data are not homogeneous, that two different distri- 
butions have by mistake been combined. Such cases are 
not uncommon in biometrical work. The existence of two 
distinct animal species where only one was suspected has 
been revealed in this way. The whole significance of a 


130 FREQUENCY DISTRIBUTION 


frequency distribution will be lost if the data are not 
homogeneous, a fact which is as true of work in the field 
of economic statistics as in any other. 


DETERMINATION OF THE MopAaL VALUE FROM MEAN 
AND MEDIAN 


Another method of securing an approximate value for 
the mode, a method based upon the relationship between 
the values of the mean, median and mode, may be em- 
ployed in certain cases. In a perfectly symmetrical distri- 
bution mean, median and mode coincide. As the distribu- 
tion departs from symmetry these three points on the scale 
are pulled apart. If the degree of asymmetry is only 
moderate the three points have a fairly constant relation. 
The mode and mean lie farthest apart, with the median 
one third of the distance from the mean towards the mode. 
If the asymmetry is marked, no such relationship may pre- 
vail. Having the values of any two of the averages in a 
moderately asymmetrical frequency distribution, therefore, 
the other may be approximated. In fact, however, the 
method should only be employed in determining the value 
of the mode, as the other two values may be computed 
more accurately by other methods. The value of the mode 
itself should only be determined in this way when more 
exact methods are not applicable or are not called for. 

The following formula is based upon this relationship: 


Mo = Mean — 3 (Mean — Md) 
Applying this formula to the telephone pole data shown 
in Table 12, the following result is secured: 
Mo = 9.33 — 3(9.33 — 9.015) = 8.385 
This value is slightly below the mid-value of the modal 
class, 8.5, and is also less than the value of 8.49 which is 


secured by weighting within the modal group (using four 
classes on each side). 


FREQUENCY DISTRIBUTION 131 


It must be emphasized that there is a fictitious accuracy 
to all these values for the mode. All the methods of locat- 
ing the mode which have been discussed are merely ap- 
proximative, a fact which must not be forgotten in inter- 
preting and utilizing the results. 


Grapuic LocaTion oF Mopr, MEDIAN, QUARTILES AND 
DECILES 


A better understanding of the frequency curve and of 
the cumulative frequency curve may be secured through a 
brief discussion of certain methods of locating graphically 
some of the statistical measures which have been described. 

The value of the mode may be readily determined from 
a frequency curve of the usual type, for, by definition, the 
mode is the reading on the horizontal scale corresponding 
to the maximum ordinate of such a curve. If this reading 
be taken from the frequency polygon a rough value will be 
obtained, the mid-value of the class of greatest frequency. 
A closer approximation to the true value of the mode will 
be secured from a curve which has been smoothed, either 
by inspection or by mathematical methods. Figure 45, 
showing a curve (smoothed by inspection) based upon the 
wage data presented in Table 8, indicates how the mode 
may be located graphically. The horizontal reading corre- 
sponding to the maximum ordinate of this curve is $27.50, 
an approximate value of the mode which may be compared 
with the values of $27.69 secured by the weighting process 
and of $27.3470 secured from the values of the mean and 
median. 

The locations of the median and mean have been indi- 
cated on this curve. It has been pointed out that in mod- 
erately asymmetrical (or skewed) distributions there tends 
to be a constant relationship between the three averages 
which have been described, the median lying between the 
mean and the mode, and approximately one third of the 


132 FREQUENCY DISTRIBUTION 


distance from the former towards the latter. In the present 
case this relationship holds fairly well when the value of 
the mode is approximated from the smoothed curve. The 
irregularities in the original data render the process of 
smoothing by inspection rather arbitrary, however. 


Frequency 
oO ou 


Fic. 45. — Distribution of Weekly Earnings of Employees. A Smoothed Fre- 
quency Curve, Showing the Relation between Mean, Median and Mode 


In Fig. 46 the same data are represented by a cumulative ° 
frequency curve, based upon Table 27. The steepness of 
a cumulative frequency curve within any given interval 
depends upon the number of cases added within the cor- 
responding interval on the horizontal scale. Thus the 
curve rises gradually at first, then more steeply, and tails 
off gradually at the upper extremity. The value of the 
mode, obviously, is the reading on the horizontal scale 
corresponding to the point of greatest steepness. This 


FREQUENCY DISTRIBUTION 133 


is the point at which the increase of frequencies is great- 
est, the point of greatest concentration in the frequency 
distribution. The value of the mode may be approxi- 
mated from a smoothed frequency curve by locating the 
point at which the slope is greatest (which is a point of 
inflection) and taking the corresponding reading on the 
z-scale. In the present case a value of approximately 
$27.50 is secured for the mode by this method. 


TABLE 27 
Cumulative Distribution of Wage-earners in a Manufacturing 
Establishment 
(Classified on the Basis of Weekly Earnings.) 
Weekly earnings Number earning stated amount 
(frequency) 
Less than $22.50 0 
a ee 23 .00 1 
es SS 23 .50 5 
ac = 24.00 8 
rh = 24.50 19 
ss se 25 .00 29 
Ss ce 25.50 41 
ss a 26.00 56 
% Ss 26.50 78 
= e 27 .00 98 
. = 2700 122 
ES Ss 28 .00 152 
= 28 .50 169 
ee ee 29 .00 186 
rg *e 29 .50 193 
es vi 30.00 199 
os <9 30.50 204 
fo 31.00 208 
i co ol DO 209 
£2 s 32.00 209 
ss « ~ $2.50 210 


Values for the median, quartiles and deciles may also be 
secured graphically from the cumulative frequency curve. 
The smoothing of such a curve provides a quite satis- 
factory method of interpolation and, if the scale of the 


134 FREQUENCY DISTRIBUTION 


diagram is sufficiently large, accurate values may be ob- 
tained by this method. Locate on the vertical scale (the 
scale of cumulative frequencies) a point distant from the 


base by *. If from this point a horizontal line be ex- 


tended to the cumulative curve, the abscissa of the point 
of intersection will be the value of the median. This value 
may be easily determined by dropping a vertical line from 
the point of intersection to the z-axis. Figure 46 illustrates 


200 
1605N 
o> 
0120 
> N VY 
bene, y, 
80 
ilo toy | 
4 7 
CLE Le 
0 eo 


25 24 725-26. 27> 28) ow SO alae 
Dollars 


Fie. 46. — Cumulative Distribution of Weekly Earnings of Employees, 
Illustrating the Graphic Location of Median and Quartiles 


the application of this method. A value of $27.125 is 
secured for the median by this method. By direct inter- 
polation a value of $27.1458 is obtained. The quartiles 
may be located in precisely the same way, the vertical scale 
being divided into quarters and horizontal lines extended to 
the cumulative curve from the points thus located on the 
vertical scale. 


For some purposes, particularly those which involve the 


FREQUENCY DISTRIBUTION 135 


averaging of rates or ratios rather than quantities, none of 
the averages which have been described is suitable. The 
geometric and the harmonic means are types of averages 
which should be familiar because they are particularly 
appropriate for such purposes. 


Tue GromMetTRIc MEAN 


The geometric mean is the nth root of the product of n 
measures; its value thus is represented by: 


M, = V 1°23. . . An 
~The geometric mean of the numbers 2, 4, 8, is 
M,=V2x4x8 
= 4 


It is obvious from the method of computation that if 
any of the measures in the series has a value of zero the 
geometric mean is zero. 

The actual computation of the geometric mean is greatly 
facilitated by the use of logarithms. In this form 


] ] oe ; 
(Den Yeeeg ia Og a2 + 26 as + +loga 


The logarithm of the geometric mean is equal to the arith- 
metic mean of the logarithms of the individual measures. 
When the measures, of which the geometric mean is de- 
sired, are to be weighted, the separate weights are intro- 
duced as exponents of the terms to which they apply. Thus 
if we represent the sum of the weights by N and the weights 


corresponding to the terms @, d@, a3 . . . Gn, respectively, 
by wi, We, wW3 . . . Wx, the formula for the geometric mean is 
M N 
ON Gb cane ag ote Aa, OP 


This is equivalent to repeating each term a number of 
times, the number corresponding to the amount by which 


136 FREQUENCY DISTRIBUTION 


it is weighted, which, of course, is precisely what is done 
in securing a weighted arithmetic mean. When logarithms 
are employed the formula for the weighted mean becomes 


W, log a; + w log a2 + w3 log ag +... + wp log ap 


Log M, = N 


The method of computing the geometric mean may be 
illustrated with reference to the following table, which 
shows the distribution of the prices of 115 dividend-paying 
preferred stocks listed on the New York Stock Exchange. 
The table is based upon closing prices on August 26, 1922. 


TABLE 28 
Computation of the Geometric Mean of Preferred Stock Prices 

Class-interval m i log m f log m 
$ 35-$ 44.9 $ 40 1 1.602060 1.602060 
45-— 54.9 50 6 1.698970 10. 193820 
55- 64.9 60 8 Slow 14. 225208 
65-— 74.9 70 5 1.845098 9 . 225490 
75—-. 84.9 80 14 1.903090 26 .643260 
85— 94.9 90 Q2 1.954243 42 993346 
95-— 104.9 100 OF 2.000000 54.000000 
105-— 114.9 110 18 2.041393 36 . 745074 
115— 124.9 120 14 2.079181 29 . 108534 
115 224 736792 

224 .736'792 Log M, = 1. 233. 

Wei gM, 954233 

115 M, = $90.00 


CHARACTERISTICS OF THE GEOMETRIC MEAN 


The nature of the geometric mean may be understood 
by considering its relation to the terms it represents, as an 
average. 

If the arithmetic mean of a series of measures replace 
each item in the series, the swm of the measures will remain 
unchanged. Thus, the sum of the numbers 2, 4, 8, is 14. 
The arithmetic mgan of these three numbers is 42; if this 
value be inserted in the place of each of the three measures 


FREQUENCY DISTRIBUTION 137 


the sum remains 14. It is characteristic of the geometric 
mean that the product of a series of measures will remain 
unchanged if the geometric mean of those measures replace 
each item in the series. Thus the product of 2, 4, 8, is 64. 
The geometric mean of the three numbers is 4; if this 
value replace each of the three measures the product 
remains 64. 

Again, it is true of the arithmetic mean that the sum of 
the deviations of the items above the mean equals the 
sum of the deviations of the items below the mean (disre- 
garding signs). The sums of the differences between the 
individual items and the mean are equal. In the case of 
the geometric mean the products of the corresponding 
ratios are equal. Ifthe ratios of the geometric mean to the 
measures which it exceeds be multiplied together, the 
product will equal that secured by multiplying together 
the ratios to the geometric mean of the measures exceeding 
it in value. For example, the geometric mean of the 
numbers 3, 6, 8, 9, is 6. The following equation may be 
set up: 6 6 8 
3XG76%6 

The last example brings out the most important charac- 
teristic of the geometric mean. It is a means of averaging 
ratios. Its chief use in the field of economic statistics has 
been in connection with index numbers of prices, where 
rates of change are of major importance. A rise in prices 
represented by the change from 50 to 100 is as important 
as a rise from 100 to 200. Yet this equivalence is not 
brought out by the arithmetic mean, which gives double 
weight to the change which involves an absolute difference 
of 100. An example frequently cited is that of two cases 
of price change, one a ten-fold increase, from 100 to 1000, 
the other a fall to one-tenth of the old price, from 100 to 
10. The arithmetic mean of 1000 and 10 is 505, the geo- 


metric mean is V1000 X 10, or 100. When the average is 


138 FREQUENCY DISTRIBUTION 


of the latter type it is seen that the two equal ratios of 
change have balanced each other. The arithmetic mean, 
505, is quite incorrect as a measure of average ratio of 
price change. This subject is discussed at greater length 
in the chapter on index numbers. 

What has been said in an earlier section in regard to the 
advantages of logarithmic charting for certain purposes 
bears upon the use of the geometric mean. This average 
is sometimes called the logarithmic mean, as its logarithm 
is simply the arithmetic mean of the logarithms of the 
constituent measures. Wherever percentages of change are 
being averaged, where ratios rather than absolute differ- 
ences are significant, the use of the geometric mean is 
advisable. 

A problem involving the use of the geometric mean arises 
in computing the average rate of increase of any sum at 
compound interest. If p, represent the principal at the 
beginning of the period, p, the principal at the end of the 
period, r the rate of interest and n the number of years 
in the period, the sum to which p, will amount at the end 
of the n years, if interest is compounded annually, is repre- 
sented by the equation: 

Pn = Po (1 +7)” 


It follows from this that: 


r=VPt_1 
Pp 


° 


Thus, if $1000 at compound interest amounts to $1600 . 
at the end of 12 years, there has been an increase of 60%. 
The arithmetic mean is 5%, but this is not the rate at 
which the money increased. The true rate is: 


i} 
= 
foal 
> 
o 

| 
—_ 


Il 


104e 1 
04, or 4% 


FREQUENCY DISTRIBUTION 139 


Precisely the same problem arises whenever rates of in- 
crease or decrease are to be averaged. The use of the 
arithmetic mean gives an incorrect result. 


THE GEoMETRIC Mran as A MEASURE OF CENTRAL 
TENDENCY 


A question arises as to the type of frequency distribution 
the central tendency of which would be best represented 
by the geometric mean. This question has given rise to 
some interesting discussion. When the absolute measures, 
plotted on the arithmetic scale, give a fairly symmetrical 
distribution, the arithmetic mean is clearly preferable to the 
geometric mean. But when the absolute figures thus 
plotted give an asymmetrical frequency curve of such a 
type that the asymmetry would be removed and a sym- 
metrical curve secured by plotting the logarithms of the 
measures, the geometric mean would appear to be prefer- 
able. Such a distribution would be one in which not the 
absolute deviations about the central tendency but the 
relative deviations, the deviations as ratios, were symmetri- 
cal. The arithmetic mean of the logarithms of the various 
measures (which value is, as has been shown, the logarithm 
of the geometric mean of the original measures) would be 
the best representative of the central tendency in such a 
distribution. The curve thus plotted would be symmetrical 
about the logarithm of the geometric mean. A frequency 
curve representing the logarithms of percentage changes in 
prices would tend to show this symmetry about the loga- 
rithm of the geometric mean of these changes. These 
percentage changes, as natural numbers, group themselves 
in an asymmetrical form, with the range of deviations 
above the arithmetic mean greatly exceeding the range 
below.1 This arises, of course, from the fact that prices of 
given commodities may increase 1,000 per cent or more 


1 Cf. Figures 49 and 50. 


140 FREQUENCY DISTRIBUTION 


from a given base, but cannot fall more than 100% from 
any given base. The section on index numbers contains 
a fuller discussion of this particular phase of the subject.+ 

A frequency curve of this type, based upon the loga- 
rithms of the measures included rather than upon the 
natural numbers, has been employed to advantage in plot- 
ting data relating to income distribution. When natural 
numbers are plotted, the range of income distribution is so 
large that it is physically impossible to prepare a chart 
which will reveal the characteristic features of all sections 
of the curve. The process of plotting on double logarithmic 
paper (which is, of course, equivalent to plotting the loga- 
rithms of both z’s and y’s) meets this difficulty, giving a 
true impression of the whole distribution and the relations 
between its parts, and, at the same time, brings out certain 
important features which are obscured in the natural scale 
chart. In particular, this device appears to smooth that 
part of the curve above the mode into a straight line, a 
fact which led Vilfredo Pareto, the first person to employ 
this method of representing income data, to enunciate what 
has been known as Pareto’s Law concerning income distri- 
bution. An intensive study of the distribution of income 
in the United States has lead the staff of the National 
Bureau of Economic Research to call into question certain 


1 C. M. Walsh, in The Problem of Estimation, 35, lays down the following 
criteria for the use of averages: 


(a) When there are no conceivable or assignable upper or lower limits to the 
values of the terms in a series, the arithmetic average should be employed. 

(b) When there is a definite lower limit at or above zero and no upper conceivable 
or assignable limit, the geometric average should be employed. Because 
this is true of price changes Walsh believes the geometric average to be 
the correct one to use in making index numbers of prices. 

(c) When in practice, or in the nature of things, certain upper and lower limits 
are found to exist and the above criteria cannot be employed, a study of 
the actual dispersion of the data is necessary. In this case, if the mode is 
found nearer to the arithmetic average, that average should be employed; 
if the mode is found nearer to the geometric average, that average should 
be used. 


FREQUENCY DISTRIBUTION 141 


conclusions drawn from Pareto’s generalizations, though the 
value of the double logarithmic scale for the presentation 
of income data has been recognized. An interesting dis- 
cussion of this type of frequency curve is found in the 
reports on Income in the United States, prepared by the 
National Bureau of Economic Research. 


Tue Harmonic MEAN 


The harmonic mean is a type of average capable of 
application only within a restricted field, but which should 
be employed to avoid error in handling certain types of 
data. It must be used in the averaging of time rates and 
it has certain distinctive advantages in the manipulation 
of certain other materials. The following example will 
illustrate the method of employing this average: 

An automobile is driven four miles at the rate of twenty 
miles per hour and four miles at the rate of thirty miles 
per hour; the average rate of speed is required. The 
arithmetic mean of the two rates is 25 miles per hour, but 
this is an incorrect solution. The machine is driven 12 
minutes at the rate of 20 miles per hour and 8 minutes at 
the rate of 30 miles. Thus in 20 minutes 8 miles are covered, 
the average rate being 24 miles per hour. This is equivalent 
to taking a weighted average of 20 and 30, the former 
weighted by 12 and the latter by 8. The same result may 
be secured by taking the harmonic mean of the two rates 
directly. The harmonic mean of a serves of measures rs the 
reciprocal of the arithmetic mean of the reciprocals of the 
individual measures. Thus if we represent a series of rates 
to be averaged by 1, 7. . . fn, the formula for the har- 
monic mean, H, is 


142 FREQUENCY DISTRIBUTION 
Using the figures just quoted: 


pe eae 
— = 20 30 
_ 2 
eee 
~ 120 
H = 24 
When the harmonic mean of two numbers, a and J, is 
; ee Qab 
required, it is simpler to work from the form H = a 
a 
a 3abc 
for three numbers, a, b and c, the form H = ——————— 
ab + ac+ be 


may be used. These are equivalent to the first formula 
given. The computation of the harmonic mean of a series 
of magnitudes is greatly facilitated by the use of prepared 
tables of reciprocals.! 

The use of the harmonic mean in handling economic data 
may be illustrated by an example from the price field. If 
a given commodity is quoted at “‘so many per dollar,” and 
it is desired to average several such quotations, the use of 
the simple arithmetic mean may give erroneous results. 
Given the quotations “‘four for a dollar,” “‘five for a dollar” 
and “twenty for a dollar,” the average number per dollar 
(i.e. the average price) is required. The arithmetic average 
of the figures as given (4, 5 and 20) is 93 which would 
appear to be the average number sold per dollar, an average - 
price of 10.34 cents apiece. But the original quotations 
are equivalent to prices of 25¢ apiece, 20¢ apiece and 5¢ 
apiece; the arithmetic average of these prices is 162¢, 
which, expressed in the original form, gives an average of 
6 per dollar. Assuming that it is desired to give equal 
weight to each quotation, the latter is the correct figure. 


' Barlow's Tables of Squares, Cubes, Square Roots, Cube Roots, Reciprocals 
(N. Y., Spar and Chamberlain, 1919) constitute a very useful compilation. 


FREQUENCY DISTRIBUTION 143 


The arithmetic mean of the quotations in the “‘so many 
per dollar’ form is a weighted average, greater weight being 
given to quotations involving a large number of commodity 
units: The correct result may be secured from the original 
quotations, however, by the use of the harmonic mean. 
The harmonic mean of 4, 5, and 20, is 6, the average number 
per dollar. 


RELATIONS BETWEEN DIFFERENT AVERAGES 


When different averages are located or computed for a 
given series of magnitudes, certain relationships between 
them are found to prevail. 


1. The arithmetic mean, the median and the mode coincide 
in a symmetrical distribution. 

2. In a moderately asymmetrical distribution the median lies 
between the mean and the mode, approximately one 
third of the distance along the scale from the former 
towards the latter. Hence, for this type of distribution 
there is an approximation to the following relationship: 

Mo = M — 3(M — Ma). 

3. The arithmetic mean of any series of magnitudes is greater 
than their geometric mean. 

4. The geometric mean of any series of magnitudes is greater 
than their harmonic mean. The only exception to the 
last two rules is found when all the measures in the series 
are equal, in which case arithmetic mean, geometric 
mean and harmonic mean are equal. 

5. The geometric mean of any two terms is equal to the 
geometric mean of the harmonic and arithmetic means of 
those terms. Thus if the terms be 2 and 8, the harmonic 
mean is 34, the geometric mean 4, and the arithmetic 
mean 5. But 4 is also the geometric mean of 34 and 5. 
This relationship does not hold when the series includes 
more than two terms. 

6. When the dispersion of data follows the arithmetic law, the 
mode and median will generally be found closer to the 
arithmetic than to the geometric average. When the 


144 FREQUENCY DISTRIBUTION 


dispersion follows the geometric law the mode and median 
will generally be found closer to the geometric than to 
the arithmetic average. 


CHARACTERISTIC FEATURES OF THE CHIEF AVERAGES 


The arithmetic mean 

1. The value of the arithmetic mean is affected by every 
measure in the series. For certain purposes it is too 
much affected by extreme deviations from the average. 

2. The arithmetic mean is easily calculated, and is determinate 
in every case. 

3. The arithmetic mean is a computed average, and hence is 
capable of algebraic manipulation. 


The median 

1. The value of the median is not affected by the magnitude 
of extreme deviations from the average. 

2. The median may be located when the items in a series are 
not capable of quantitative measurement. 

3. The median may be located when the data are incomplete, 
provided that the number and general location of all the 
cases be known, and that accurate information be avail- 
able concerning the measures near the center of the 
distribution. 

4. The median is not so well adapted to algebraic manipu- 
lation as are the arithmetic, geometric and harmonic 
means. 


The mode 

1. The value of the mode is not affected by the magnitude of | 
extreme deviations from the average. 

2. The approximate mode is easy to locate but the determi- 
nation of the true mode requires extended calculation. 

3. The mode has no significance unless the distribution in- 
clude a large number of measures and possess a distinct 
central tendency. 

4. The mode is the average most typical of the distribution, 
being located at the point of greatest concentration. 

5. The mode is not capable of algebraic manipulation. 


FREQUENCY DISTRIBUTION 145 


The geometric mean 

1. The geometric mean gives less weight to extreme devi- 
ations than does the arithmetic mean. 

2. It is strictly determinate in averaging positive values. 

3. The geometric mean is the form of average to be used 
when rates of change or ratios between measures are to 
be averaged, as equal weight is given to equal ratios of 
change. It is particularly well adapted to the averaging 
of ratios of price change. a 

4. The geometric mean is capable of algebraic manipulation. 


The harmonic mean 

1. The harmonic mean is adapted to the averaging of time 
rates and certain similar terms. It has been employed 
in the field of economic statistics in the manipulation of 
price data. 

2. The labor of computing the harmonic mean and its un- 
familiarity detract from its usefulness in ordinary sta- 
tistical analysis. 

3. The harmonic mean is capable of algebraic manipulation. 


This summary has been designed to show that each 
type of average has its own particular field of usefulness. 
Each one is best for certain purposes and under certain 
conditions. The characteristics and limitations of each one 
should be understood in order that it may be appropriately 
employed. A complete description of a frequency distri- 
bution frequently calls for the determination of two or 
three of the chief averages, as well as other statistical 
measurements. The arithmetic mean is perhaps the most 
useful single average. The simplicity of its computation, 
the possibility of employing it in algebraic calculations and 
the fact that its meaning is perfectly definite and familiar 
make it highly serviceable in statistical work. Its sphere 
of usefulness is not universal, however, and it should only 
be employed when the given conditions render it suitable. 
A fuller appreciation of the distinctive virtues of the geo- 
metric mean is leading to a wider employment of that 


146 FREQUENCY DISTRIBUTION 


measure in many types of statistical work. A discriminat- 
ing use of averages is essential to sound statistical analysis. 


REFERENCES 


Bowtey, A. L. Elements of Statistics (82-109, 138-139). 

Cuappock, R. E. Principles and Methods of Statistics (Chaps. VI, 
VII, VIII). 

Jones, D. C. A First Course in Statistics (22-41). 

Ketuiey, Truman L. Statistical Method (44-68). 

Kine, W. I. Elements of Statistical Method (121-140). 

Prarut, Raymonp. Medical Biometry and Statistics (264-272). 

Ruaee, H. O. Statestical Methods Applied to Education (97-148). 

Secrist, Horace. Introduction to Statistical Methods (234-292). 

Watsu, C. M. The Problem of Estimation (1-67). 

Yur, G. U. An Introduction to the Theory of Statistics (106-132). 

ZizEK, Franz. Statistical Averages (82-109). 


CHAPTER V 


DESCRIPTION OF THE FREQUENCY DISTRIBUTION: 
MEASURES OF VARIATION AND SKEWNESS 


In the preceding chapters we have been concerned, first, 
with methods of reducing a mass of quantitative data to a 
form in which the characteristics of the mass as a whole 
may be readily determined and, in the second place, with 
methods of describing the assembled data. The first ob- 
ject is accomplished with the formation of a frequency 
distribution. The second is partially accomplished when 
there has been obtained a single significant value in the 
form of an average which represents the central tendency 
of the distribution. But any average, by itself, fails to 
give a complete description of a frequency distribution. 
Three other values are needed before the chief character- 
istics of a given distribution have been measured, and 
comparison with other distributions is possible. The first 
of these is a measure of the degree to which the items in- 
cluded in the original distribution depart or vary from the 
central value, the degree of “‘ scatter,” variation or dispersion. 
The second is a measure of the degree of symmetry of the 
distribution, of the balance or lack of balance on the two 
sides of the central value. The third is a measure of 
kurtosis, of the degree to which there is a bunching of cases 
at the modal value. The present chapter deals with various 
measures of variation and skewness. The method of 
measuring kurtosis is referred to at a later point. 


NATURE AND SIGNIFICANCE OF VARIATION 


The fact of variation in collections of quantitative data 
has been pointed out in earlier sections and the bearing of 
147 


148 DESCRIPTION OF FREQUENCY DISTRIBUTION 


this fact upon the work of the statistician indicated. Prac- 
tically every collection of quantitative data, consisting of 
measurements from the social, biological or economic field, 
is characterized by variation, by quantitative differences 
between the individual units. And this fact of variation is 
as important as the fact of family resemblance. Biological 
variation has been a fundamental factor in the evolutionary 
process. No measurement of a physical characteristic of a 
racial group, such as height, is complete without an ac- 
companying measure of the average variation in the group 
in this respect. The average income in a country is perhaps 
of less significance than the variation in income, the differ- 
ences between the incomes received by different social 
classes. Price variations interrupt the normal functioning 
of the economic system, causing hardship to some and 
giving unearned profits to others, because the various ele- 
ments in the price system are unequally affected. It is not 
the general change in the price level but the differences 
between price changes which cause trouble. 

An average, by itself, has little significance unless the 
degree of variation in the given frequency distribution is 
known. If the variation is so great that there is no pro- 
nounced central tendency an average has no significance. 
With a decrease in the degree of variation an average 
becomes increasingly significant. Whether a single fre- 
quency distribution is being described, therefore, or com- 
parison is being made with other distributions, a measure of 
central tendency must be supplemented by a measure of ° 
variation. 


Measures ofr ABSOLUTE VARIATION 


Variation may be expressed in terms of the units of 
measurement employed for the original data, or may be 
expressed as an abstract figure, such as a percentage, 
which is independent of the original units. When the 
original units are employed absolute variability is measured; 


MEASURES OF VARIATION AND SKEWNESS 149 


when an abstract figure is secured we have a measure of 
relatwe variability, more suitable for comparison than the 
former type. Measures of absolute variability are first 
considered. 


THe RANGE 


A rough measure of variation is afforded by the range, 
which is the absolute difference between the value of the 
smallest item and the value of the greatest item included 
in the distribution. Table 30 shows the distribution of 
London-New York monthly exchange rates during the 
period 1882-1913. The smallest item among the original 
figures included in the table is $4.83; the greatest is $4.908. 
The range, therefore, is $4.908-$4.83, or $.078. A distance 
on the scale equal to $.078 will include every item. If the 
original data were not to be had the range could be ap- 
proximated from the frequency table. It would be the 
difference between the lower limit of the class at the lower 
extreme of the distribution, and the upper limit of the 
class at the upper extreme, or $.085 in the present case. 

The value of the range, it is obvious, depends upon the 
values of the two extreme cases only. A single abnormal 
item would change its value materially. Because it is 
erratic and is likely to be unrepresentative of the true 
distribution of items, it is seldom used in statistical work. 
The range is frequently employed as a measure of stock 
market fluctuations, though its adequacy for this purpose 
may be questioned. 


Toe MeraAn DEVIATION 


A more accurate measure of the dispersion of items 
about a central value is afforded by the simple device of 
measuring the deviation of each item from this central 
value and averaging these deviations. The following 
simple example illustrates the method of computation: 


150 DESCRIPTION OF FREQUENCY DISTRIBUTION 


TABLE 29 
Computation of Mean Deviation 

m if: d 
3 1 6 M=9 
6 1 3 ae 
: y M.D. = — = 3.6 
12 1 3 ee: 
15 1 6 

18 


The average (the mean and median coincide in this case) 
is 9. The deviations are totaled, taking no account of 
algebraic signs, and the total divided by the number of 
items. This procedure is described by the expression 
xd 


M.D. = — 
N 


In general terms, the mean deviation of a series of mag- 
nitudes is the arithmetic mean of their deviations from an 
average value (either mean or median). In the process of 
summation and averaging the algebraic signs of the devi- 
ations are disregarded. In practice it makes little differ- 
ence whether deviations be measured from the mean or 
the median. Theoretically the latter should be chosen, for 
the value of the mean deviation is least when the median 
is the base. 

Table 30 illustrates the computation of the mean devi- 
ation when the data are grouped in a frequency distri- 
bution. In the computation of the arithmetic mean of 
the items of a frequency distribution the assumption is 
made, it will be remembered, that all the cases in a given 
class are concentrated at the mid-point of that class; in 
other words, that the mid-value of the class may be taken 
as the value of each of the measures included. The same 
assumption is made in the computation of the mean 
deviation. 


MEASURES OF VARIATION AND SKEWNESS 151 


TABLE 30 


Computation of Mean Deviation 
London-New York Exchange Rates, 1882-1913 
(The table is based upon rates prevailing at the beginning of each month) 


Class-interval 


PR 


a a eo 
ie) 
fr) 
2 
i 


8275-$4 
.8325— 4 


Mid- 


point 


- 
Pe ELE LE RE RR eRe REE 


Fre- 
quency 


Devia- 


Arbitrary origin = $4.8700 
Median= 4.8721 
Difference =$ .0021 


196 cases too small by 

$.0021 

188 cases too large by 

$.0021 

Net result: 8 cases too 

small by $.0021 

Sum of deviations from 

arbitrary origin= $5.020 

Correction (8 times 

$.0021) = $.0168 

Sum of deviations from 

median= $5.020 + .0168 
= $5.0368 


MED He $5 .0368 
384 


= $.01312 


The mean deviation might be computed from this table 
by the method followed in Table 29. The deviation of the 
mid-point of each class from the median would be com- 
puted, multiplied by the number of cases in that class, and 
the total divided by the number of items. Thus the 
median of the exchange rates is $4.8721; the deviation of 
the item in the first class is $.0421, of each of the six items 
in the second class $.0371, etc. But the handling of these 


152 DESCRIPTION OF FREQUENCY DISTRIBUTION 


true deviations is laborious, since fractional quantities are 
involved. It is simpler to measure the deviations from 
some arbitrary value, sum these deviations from the assumed 
median (or mean), and correct this total by an amount 
equal to the error caused by the measurement of deviations 
from a value other than the true average. This is the 
method followed in Table 30. 

The same method would be employed in computing the 
mean deviation from the arithmetic mean. In either case 
the computations might be simplified still further by 
measuring deviations in class-interval units, as in some of 
the calculations in the preceding chapter. 

In Table 30 deviations have been measured not from 
$4.8721, the value of the median, but from $4.870, which 
is selected arbitrarily as the origin. The sum of the devi- 
ations from this arbitrary origin, regardless of signs, is 
$5.020. By how much does this differ from the sum of the 
deviations from the median? For each of the 45 cases in 
the class having as mid-point the arbitrary origin, $4.870, 
the value of d’ (the deviation from the arbitrary origin) is 
zero. They deviate from the median, however, by the 
difference between $4.870 and $4.8721, or $.0021. In this 
case the measured deviations are too small by $.0021. 
The same error is made in measuring the deviation of each 
item in all the lower classes; in all there are 196 cases in 
which the deviation from the arbitrary origin is less by 
$.0021 than the deviation from the median. But for all 
the observations falling in classes above that in which the ° 
arbitrary origin lies, the error is in the opposite direction. 
Thus, for the 49 items in the class next above that in which 
the origin falls, the deviation from the arbitrary origin is 
$.005, while the deviation from the median is $.0029; the 
deviations as measured are too large by $.0021. In all, 
there are 188 cases in which the error is in this direction. 
The net result, therefore, is that 8 are too small by $.0021. 
The sum of the deviations, $5.020 is, therefore, too small 


MEASURES OF VARIATION AND SKEWNESS 153 


by 8 X $.0021, or $.0168. The sum of the deviations from 
the median is thus $5.0368, and the mean deviation is 
$.01312. 

This method may be summarized in the following 
formula: 
_ 2(fd') +(N.— Nie 


M.D. 
N 


in which 

N, = Number of cases for which deviations measured from 
arbitrary origin are smaller than deviations measured 
from median (or mean). 

Ni = Number of cases for which deviations measured from 
arbitrary origin are larger than deviations measured 
from median (or mean). 

c = Difference between arbitrary origin and the median (or 
mean). 


For the application of this formula it is necessary that 
the arbitrary origin and the median (or mean) be located 
within the same class-interval.! 

The following outline indicates the steps to be taken in 
the computation of the mean deviation: 


1. Determine the value of the median. 

2. Taking the mid-point of the class containing the median as 
arbitrary origin, measure the deviations from this origin 
of the items in each class. Multiply these deviations by 
the class-frequencies and secure the total of the products 
without regard to sign. 

3. Count the number of cases in which the deviations from the 
arbitrary origin are greater than the deviations from the 
median, and the number of cases in which they are less than 
the deviations from the median. Get the difference be- 
tween these two figures; multiply by this quantity the 

1 The above calculations proceed on the assumption that all the cases in the 

class containing the median may be treated as though they were concentrated 
at the mid-point of that class. A slight gain in accuracy might be made by 
assuming a uniform distribution throughout this class-interval and modifying 
the method accordingly. (Cf. Handbook of Mathematical Statistics, edited by 
H. L. Rietz, 29-31, for an exposition of this method.) 


154 DESCRIPTION OF FREQUENCY DISTRIBUTION 


difference between the median and the arbitrary origin. 
This is the amount by which the sum of the deviations 
from the arbitrary origin differs from the sum of the 
deviations from the median. Apply the above correc- 
tion to the sum of the deviations from the arbitrary 
origin. 

4. Divide the corrected sum of the deviations by the number 
of items to get the mean deviation from the median. 

(The mean deviation from the mean may be computed by an 

identical process.) 


Ture STANDARD DEVIATION 


The process of calculating the mean deviation is alge- 
braically illogical because algebraic signs are disregarded. 
In the computation of the standard deviation this error is 
avoided and a measure of more precise mathematical sig- 
nificance is secured. The conventional symbol for the 
standard deviation is the Greek letter sigma, co. 

In computing this measure the deviations of the indi- 
vidual items from the arithmetic mean are squared, totaled, 
the mean of the squared deviations obtained, and the 
square root of this mean extracted. The standard devia- 
tzon is, thus, the square root of the mean of the squared 
deviations. This measure is also termed the root-mean- 
square deviation, a useful name because it describes in full 
the method of calculation. The deviations are always 
measured from the arithmetic mean, as the value of the 
measure is a minimum under these conditions. A simple 
example will illustrate the process. 


Taste 31 
Computation of Standard Deviation 
m f a? 
3 1 — 6 36 M=9 

6 1 —3 9 
9 1 0 0 res) 00 
12 1 +3 9 5 
15 1 6 36 = 
- “, nas =V18 


MEASURES OF VARIATION AND SKEWNESS 155 


When the standard deviation is computed from un- 
grouped data, as in this case, the formula is 


o = \/20. 
N 


When the items are grouped in a frequency distribution 
the task of computation is a little more complicated. The 
‘measurement of deviations from an arbitrary origin is 
essential in this case, as it greatly simplifies the calculations. 
The general formula for the standard deviation is 


o =\/248 


fi? 
N 
where f represents the class-frequencies, d the deviations 
from the arithmetic mean and N the number of cases 
included. It follows that 2 
hi 
o = Nao 
If a deviation from an arbitrary origin be represented by 
d’ and the root-mean-square deviation from this origin be 
represented by s, we have 
eo Sd 
N 
The root-mean-square deviation from the mean (c) is less 
than the root-mean-square deviation from any other point 
on the scale. Hence s’ is greater than o?. We may repre- 
sent by c the difference between the true mean and the 
arbitrary origin. It may be readily established ! that 


= 8 — Cc, 
ad? 
Dhorig? a= but 2d = 0 
or 0 N UL 
D(a’)? 
a2 1 D(d')? = Dd? + Ne 
Ld’)? La 
, —_oO 
d =d+e ere 
(d’)? = 22+ %¢d + 2 Fa +e 
LX(d’)? = Dd? + WLd + Ne? =e 


Cf. Yule, Introduction to the Theory of Statistics, 134. 


156 DESCRIPTION OF FREQUENCY DISTRIBUTION 


The value of the standard deviation may be most easily 
determined, therefore, by computing s? and c?. The oper- 
ations involved are illustrated in detail in Table 32, show- 
ing the distribution of London-Paris exchange rates over a 
period of 384 months. 


TaBLE 32 
Computation of Standard Deviation 
London-Paris Exchange Rates, 1882-1913 


(The table is based upon rates prevailing at the beginning of each month, 
as quoted in the Economist) 


(1) (2) (3) (4) (5) (6) (7) (8) 
; Deviation 
Class-interval Mid- Fre- from 
( francs) Be quency arbitrary 
( francs) se 
origin 
m u d’ ft fd) @+1? f@+? 
25 .07-25 .089 25.08 1 -8 —- 8 64 49 49 
25 .09-25.109 25.10 4 —7 ‘’— ®8 196 36 144 
25 .11-25.129 25.12 14 —-6 — 84 504 Q5 350 
25 .138-25.149 25.14 20 —5 — 100 500 =§=s-:16 320 
25 .15-25.169 25.16 45 —4 —180 720 9 405 
25.17-25.189 25.18 60 —3 — 180 540 A 240 
25 .19-25.209 25.20 40 —-2 — 80 160 1 40 
25 .21-25 .229 25.22 43 -1l — 43 43 0 
25 . 23-25 .249 25.24 42 0 i 42 
25 .25-25.269 25.26 32 i? 32 ~~ 82 4 128 
25 . 27-25 .289 25.28 26 2 52 104 9 234 
25 .29-25 .309 25.30 Q1 3 63 189 16 336 
9531-25 (329) 25.32 20 4 80 320 95 500 
25 .33-25 .349 25.34 4 5 20 100 36 144 
95 .35-25 .369 25.36 6 6 36 216 49 294 
25 .37-25 .389 25.38 2 uf 14 98 64 128 
25 .39-25 .409 25.40 2 8 16 128 81 162 
25 41-25 .429 25.42 g 9 18 162 100 200 
384 — 372 4,076 3,716 

N = 384 
class-interval = .02 francs 
¢ (in class-interval units) = =e = — .969 


c” (in class-interval units) = .9390 


MEASURES OF VARIATION AND SKEWNESS 157 


o” (in class-interval units) = s? — c? = 10.6146 — .9390 = 9.6756 
o (in class-interval units) = 3.11 
o (in original units) = 3.11 x .02 = .0622 

The entire calculation, it will be noted, is carried through 
in terms of class-interval units, the result being reduced to 
the original units in the final operation. In computing c, 
the difference between the true mean and the arbitrary 
origin, the algebraic sum of the deviations is divided by the 
number of cases. The arithmetic mean could be deter- 
mined by reducing c to original units and adding this value 
(algebraically) to the value of the arbitrary quantity 
selected as origin, but this is not an essential step. The 
actual value of the mean need not be known in the com- 
putation of the standard deviation. 

A check upon the accuracy of the calculations (the 
Charlier check*) is afforded by the figures in cols. (7) and 
(8) of Table 32. If deviations be measured, not from the 
arbitrary origin employed in computing the standard devi- 
ation, but from an origin one class-interval below, we 
secure a set of values equal to d’ +1. The squares of 
these values are given in col. (7). Multiplying by the 
corresponding frequencies we have the quantities recorded 
in col. (8), the sum of which is 3,716. This total stands in 
a definite relationship to the values secured in computing 
the standard deviation. For 

Zf(d’ + 1)? = Sf (d’)? + 2d’ + 1] 
= Df(d')? + WBfd' + Bf 
or Dp(d’ +1)? = Sf(d’)? + fd’ + N 

Inserting in this last equation the values secured from 

the calculations shown in Table 32, we obtain this check: 
3,716 = 4,076 + 2(— 372) + 384 
= 3,716 


1 Cf. Vorlesungen Uber Die Grundziige Der Mathematischen Statistik, C.V.L. 
Charlier, 19. 


158 DESCRIPTION OF FREQUENCY DISTRIBUTION 


The following is a summary of the steps in the process of 
computing the standard deviation of items grouped in a 
frequency distribution: 


1. Select as arbitrary origin the mid-point of a class near the 
center of the distribution. 

2. Measure the deviations from this point of the items in 
each class, in class-interval units. Multiply the devia- 
tions by the corresponding class-frequencies. 

3. Divide the algebraic sum of the deviations by N. This 
gives c, in class-interval units. Compute c’. 

4. Square the deviations and multiply by the corresponding 
class frequencies. 

5. Divide the sum of the squared deviations by N. This 
gives s’, in class-interval units. 

6. From the formula, o? = s? — c?, compute o?. Extract the 
square root of this value, securing ¢ in class-interval units. 

7. Multiply o, as thus computed, by the class-interval. The 
result is o in the original units of measurement. 


Certain of the characteristics of the standard deviation 
and its relation to other measures of dispersion are described 
in a later section.! 


THE QuaRTILE DEVIATION 


In the chapter on averages methods of locating the 
quartiles and deciles were described. The former are those 
points on the scale of values, along which the items of a 
given distribution lie, which divide the total number of 
items into four equal groups. The deciles are those points 
dividing the total number of items into ten equal groups. 
The degree and character of the variation in a frequency 
distribution may be accurately described if the location of 
the quartiles and deciles is shown. Such knowledge, how- 
ever, while helpful in giving a picture of the distribution, 
is not as useful for purposes of concise description and 


1 A correction to be applied to the standard deviation in certain cases 
(Sheppard’s correction) is described in Chapter XV. 


MEASURES OF VARIATION AND SKEWNESS 159 


comparison as knowledge of the values of the mean devia- 
tion or the standard deviation. The significance of a single 
measure is more readily grasped than is the meaning of a 
number of inter-related values. Such a measure of varia- 
tion may be computed from the quartiles, however. With 
regard to ease of calculation and immediate significance 
this quartile deviation has distinct merits. 

Within the range between the two quartiles, of course, 
one half of all the measures are included. The greater the 
concentration the smaller this interval, hence a fairly accu- 
rate measure of dispersion may be obtained from the 
relationship between these two quartiles. The quartile 
deviation is the semi-interquartile range, half the distance 
along the scale between the first and third quartiles. Thus 
if Q.D. represent the quartile deviation, Q, the first quartile 
and Q; the third quartile, 

QD. = BES. 


If the value of a point on the scale half-way between 
the first and third quartiles is represented by K, one half 
of all the measures in a frequency distribution will fall 
within the range K + Q.D. For the preceding data, oa 
Paris exchange rates, we have 


Qs = 25.262 
Q, = 25.174 
5.262 — 25.174 
Cie ee 
2 
= .044 
K = 25.174 + .044 
= 25.218. 


Thus one half of all the measures lie within the range 
25.218 + .044. This statement, together with a statement 
of the average value of exchange rates (mean, median or 
mode), constitutes a useful description of the distribution. 
In a perfectly symmetrical distribution the value of K will 


160 DESCRIPTION OF FREQUENCY DISTRIBUTION 


coincide with the value of the median (that is, the median 
will lie half-way along the scale from Q; to Q;). The dis- 
tribution of exchange rates is asymmetrical, the value of 
the median being 25.214, as compared with the value of 
25.218 for K. 


Tue PROBABLE ERROR 


In studying the results of astronomical and other physi- 
cal measurements it has been found that the values secured 
by different observers for the same constant quantity vary. 
These varying results, however, are distributed in a certain 
definite way, and when plotted give a curve similar to the 
normal curve of error. In such cases there is an immediate 
and obvious need of some measure of variation which may 
be used as an index of the reliability of given results. If 
the results secured by different investigators, or by the 
same investigator at different times, vary widely they can 
not be accepted as reliable, while the reverse is true if the 
variation is slight. ‘The measure of dispersion which has 
been generally employed in such cases is termed the prob- 
able error. The probable error is that amount which, in a 
given case, is exceeded by the errors of one half the ob- 
servations. Since the most probable value of a given series 
of observations is their arithmetic mean, the probable 
error is always measured from the mean. The name of 
this measure derives from the fact that the probability 
that a given observation will vary from the mean of all 
the observations by an amount greater than the probable 
error is exactly 4. It follows that, when the observations 
are arranged in the form of a frequency distribution, an 
amount equal to the probable error laid off on each side of 
the arithmetic mean will include one half of the total 
number of cases. 

This measure of variation has been employed in fields 
other than that in which it was originally applied, fields in 
which the original name of probable error is somewhat 


MEASURES OF VARIATION AND SKEWNESS 161 


misleading. In such cases it is perhaps better to think of 
it as the probable deviation, that distance from the mean 
which will be exceeded by one half of the total deviations. 

The probable error is a measure of dispersion which is 
fully significant only when it applies to a distribution fol- 
lowing the normal law of error. In such cases it has a 
definite and precise meaning. This is not so when it is 
applied to skewed distributions, and its use in such cases 
is not advisable. The quartile deviation, the value of 
which is equal to that of the probable error in a normal 
distribution, has a more direct significance than the prob- 
able error in the description of abnormal distributions, and 
should be employed in such cases. In a later section the 
use of the probable error as a measure of the reliability of 
statistical results is more fully explained. 

The value of the probable error in a given case, assum- 
ing a normal distribution to prevail, may be determined 
from the value of the standard deviation, for there is a 
constant relationship between these two. This is expressed 
by the formula: P.E. = 0.6745 c. 


RELATIONS BETWEEN DIFFERENT MEASURES OF 
VARIATION 


An understanding of the significance of the various 
measures of dispersion described above may be facilitated 
by a general comparison and a summary statement of the 
relations holding between them. 

1. The range is a distance along the scale within which all the 

observations lie. 

2. The quartile deviation or semi-interquartile range is a distance 
along the scale which, when laid off on each side of the 
point midway between the two quartiles, includes one 
half of the total number of observations. 

3. The mean deviation from the mean, in a symmetrical or 
slightly skewed distribution, is equal to about # of the 
standard deviation. A range of 73 times the mean 


162 DESCRIPTION OF FREQUENCY DISTRIBUTION 


deviation, centering at the mean, will include approxi- 
mately 99% of all the cases. 

4. When a distance equal to the standard deviation is laid off 
on each side of the mean, in a normal or only slightly 
skewed distribution, about two thirds of all the cases will be 
included. (In the normal distribution exactly 68.26% of 
the observations will be included.) When a distance 
equal to twice the standard deviation is laid off on each 
side of the mean approximately 95% of the cases will be 
included (exactly 95.46% in a normal distribution). 
When a distance equal to three times the standard 
deviation is laid off on each side of the mean about 99% 
of all the observations will be included (exactly 99.73% 
in a normal distribution). This general rule that a range 
of six times the standard deviation, centering at the 
mean, will include about 99% of all the measures fur- 
nishes a useful check upon calculations. 

A study of Fig. 43 may help to make clear the significance of 
the standard deviation in a normal distribution. 

5. The probable error, in a normal distribution, is equal to 
0.67450. A range of twice the probable error, center- 
ing at the mean, will include 50% of all the observations. 
A range of eight times the probable error, centering at 
the mean, will include approximately 99% of all the 
observations. 


CHARACTERISTIC FEATURES OF THE CHIEF MEASURES 
OF VARIATION 
The Range 


1. The range is easily calculated and its significance is readily © 
understood. As a rough measure of the degree of vari- 
ation the range is useful. 

2. The value of the range is determined by the values of the 
two extreme cases. It is thus a highly unstable measure, 
the value of which may be greatly changed by the addi- 
tion or withdrawal of a single figure. 

3. This measure gives no indication of the character of the 
distribution within the two extreme observations. 


The 


Us 


The 


MEASURES OF VARIATION AND SKEWNESS 163 


quartile deviation 

The quartile deviation is a measure of dispersion which is 
easily computed and readily understood. It is superior 
to the range as a rough measure of variation. 


. The quartile deviation is not a measure of the variation 


from any specific average. 


. This measure is not affected by the distribution of the 


items within the first or third quartiles, or by the distri- 
bution outside the quartiles. The values of the quartile 
deviation might be the same for two quite dissimilar 
distributions, provided the quartiles happened to coin- 
cide. Because it is not affected by the deviations of 
individual items it cannot be accepted as an accurate 
measure of variation. 


. The quartile deviation is not capable of algebraic treatment. 


mean deviation 


. The mean deviation is affected by the value of every ob- 


servation. As the average difference between the indi- 
vidual items and the median (or mean) of the distribution 
it has a precise significance. 


. The mean deviation is less affected by extreme deviations 


than the standard deviation. 


. Mathematically, the mean deviation is not as logical or as 


convenient a measure of dispersion as the standard 
deviation. 


standard deviation 


. The standard deviation is affected by the value of every 


observation. 


. The process of squaring the deviations before adding avoids 


the algebraic fallacy of disregarding signs. 


. The standard deviation has a definite mathematical mean- 


ing and is perfectly adapted to algebraic treatment. 


. The standard deviation is, in general, less affected by 


fluctuations of sampling than the other measures of 
dispersion. 


. The normal curve of error has been analyzed in terms of 


the standard deviation. The information thus obtained 
has increased greatly the utility of the standard deviation. 


164 DESCRIPTION OF FREQUENCY DISTRIBUTION 
The probable error 


1. The probable error has a definite meaning in the case of a 
distribution following the normal law. It has not this 
precise meaning for other distributions, and should not 
be employed in describing them. 

2. For distributions to which it is adapted, the probable 
error is an extremely useful measure. Its most im- 
portant use is as an index of the reliability of quantitative 
measurements. 

3. The definite relationship between the probable error and 
the standard deviation, for a normal distribution, permits 
the value of the probable error to be readily determined. 


All the measures of variation described above may be 
utilized for particular purposes. The standard deviation, 
however, is the best general measure and should be em- 
ployed in all cases where a high degree of accuracy is re- 
quired. The probable error is, in effect, merely a fractional 
part of the standard deviation, with a definite but re- 
stricted field of usefulness. 


Tur MEASUREMENT OF RELATIVE VARIATION 


We have been dealing in the preceding section with 
absolute variability. The various measures of dispersion 
secured by the methods outlined describe the _ vari- 
ability of the data in terms of absolute units of measure- 
ment. The standard deviation of London-Paris exchange 
rates is in francs, the standard deviation of pig iron pro- . 
duction in tons, etc. If the object in a given case is the 
description of a single frequency distribution it is desirable 
that the original unit be employed throughout, but if 
measures of variation of two different distributions are to 
be compared, difficulties are encountered. This is clear 
if the units are unlike, but even if the units are identical 
the same difficulty arises. Thus measures of variation in 
the weights of dogs and in the weights of horses might 


MEASURES OF VARIATION AND SKEWNESS 165 


have been computed, both in pounds. Because the 
standard deviation of horse weights is greater than the 
standard deviation of dog weights, it does not follow that 
the degree of variability is greater in the former case. A 
measure of absolute variation is significant only in relation 
to the average from which the deviations are measured. 
Its use, apart from this average, is meaningless. For com- 
parison, therefore, it must be reduced to a relative form, 
and the obvious procedure is to express a given measure 
of variation as a percentage of the average from which 
the deviations have been measured. The quantity thus 
becomes an abstract number, a measure of the relative 
variability of the given observations, and may be compared 
with similar terms computed from other distributions. 


Tue CoEFFICIENT OF VARIATION 


The measure of relative variation most commonly em- 
ployed is that developed by Pearson, termed the coefficient 
of variation, and represented by the letter V._ It is simply 
the standard deviation as a percentage of the arithmetic 
mean. Thus 


Applying this formula to the results secured from the 
analysis of London-Paris exchange rates, we have 


_ 0622 
= 25.2206 * 


= .25%. 


The variability of London-New York exchange rates dur- 
ing the same period may be computed from the data of 
Table 30. The coefficient of variation for this series has 
a value of .33%. Fluctuations in the London-New York 
rate were significantly more pronounced than in the London- 
Paris rate during this period. 


A 100. 


166 DESCRIPTION OF FREQUENCY DISTRIBUTION 


An index of variability similar to this coefficient might 
be secured by expressing any of the other measures of 
deviation as a percentage of the average from which the 
deviations were computed. Pearson’s coefficient has been 
generally adopted, however, and is the only one in wide use. 


MEASURES OF SKEWNESS 


Methods have been developed in the preceding sections 
for describing the central tendency of a frequency distri- 
bution and for measuring the degree of concentration or 
lack of concentration about that central tendency. One 
further measure is needed, and that is one which indicates 
the degree of skewness or asymmetry of a given distri- 
bution. For it is essential to know, in regard to a given 
distribution, whether the observations are arranged sym- 
metrically about the central value, or are dispersed in an 
uneven, asymmetrical fashion about that value. Having 
such a figure it will be possible effectively to summarize - 
the characteristics of a frequency distribution in three 
simple terms — an average, a measure of dispersion and a 
measure of skewness. Several such measures have been 
evolved. 

If a frequency curve is perfectly symmetrical, mean, 
median and mode will coincide. As the distribution de- 
parts from symmetry these three values are pulled apart, 
the difference between the mean and the mode being 
greatest. This difference may be used, therefore, as a 
measure of skewness. It is desirable in this case, as in. 
measuring relative variability, to secure an index in the 
form of an abstract number, which may be compared with 
similar figures derived from other distributions. To this 
end, Pearson has proposed dividing the absolute difference 
between mean and mode by the standard deviation of the 
given distribution. His formula is 
M—-M 0. 


sk (skewness) = = 


MEASURES OF VARIATION AND SKEWNESS 167 


In a symmetrical distribution, where mean and mode coin- 
cide, the value of this measure will obviously be zero. The 
value may be positive or negative, depending upon the 
relative positions of the two averages on the scale. 
For moderately skewed distributions the degree of skew- 
ness may be computed more readily from the formula 
3(M — Md) 
Sige 


sk 


This corresponds approximately to the other formula, be- 
cause of the fact that in a moderately asymmetrical distri- 
bution the median lies between the mean and the mode, 
about one third of the distance from the former towards 
the latter. 

Because it is difficult to locate the mode by simple 
methods, a measure of skewness more easily computed than 
Pearson’s is desirable in some cases. Bowley has proposed 
such a method, based upon the relationship between the 
first and third quartiles and the median. If the distribution 
is symmetrical these two quartiles will be equidistant from 
the median; with an asymmetrical distribution this is not 
so. Therefore, if we let q. represent the difference between 
the upper quartile and the median and q represent the 
difference between the median and the lower quartile, we 
may use the formula 

oe 1 
Q2+ 


as a means of securing a measure of skewness. This value 
will vary between 0 and + 1. For with perfect symmetry 
g2 = 41, and the measure is 0; with asymmetry so pro- 
nounced that the median and one of the quartiles coincide, 
either g2 or g1 becomes equal to zero, and the formula gives 
a value of + 1 or — 1. Bowley suggests that a value of .1 
indicates a moderate degree of skewness, while a value of 
.3 indicates marked skewness. 

The values secured from this measure are not, of course, 


168 DESCRIPTION OF FREQUENCY DISTRIBUTION 


comparable with the values secured from the application 
of Pearson’s formula for measuring skewness. 


KurtTosis 


Reference has been made to a fourth characteristic of a 
frequency curve which may be measured. This is the degree 
of flat-toppedness, as compared with the normal curve. 
A measure of kurtosis, the technical term for this charac- 
teristic, is given in Chapter XV. 


REFERENCES 


Bow.ey, A. L. Elements of Statistics (110-117). 

Cuappock, R. E. Principles and Methods of Statistics (Chap. IX). 

Daviss, G. R. Introduction to Economic Statistics (29-46). 

Jones, D. C. A First Course in Statistics (42-51). 

Kewutey, Truman L. Statistical Method (70-82). 

Kine, W. I. Elements of Statistical Method (141-158). 

Peart, Raymonp. Medical Biometry and Statistics (272-278). 

Rietz, H. L. (editor). Handbook of Mathematical Statistics (27-33). 

Ruee, H. O. Statistical Methods Applied to Education (149-179). 

Srecrist, Horace. An Introduction to Statistical Methods (377- 
424), 

West, Cart J. Introduction to Mathematical Statistics (45-58). 

Yuur, G. U. Introduction to the Theory of Statistics (133-156). 


CHAPTER VI 
INDEX NUMBERS OF PRICES 


Tue Nature oF INDEx NUMBERS 


The term “index number” has been applied to a number 
of somewhat similar devices employed in the analysis of 
statistical series. Index numbers have been most widely 
used in the study of price changes, but a brief considera- 
tion of certain other uses may make clear the essential 
characteristics of such measures. In its simplest form this 
name is applied to a term in a time series expressed as a 
relative. Thus an index number of cotton consumption in 
the United States might take the following form: 


TABLE 33 
Domestic Cotton Consumption in the United States, 1913-1923 
(Consumption in 1913 = 100) 


Cotton Consumption Cotton Consumption 
Year (Unit: Running Bale) Relative 
1913 5,583,468 100.0 
1914 5,448,760 97.6 
1915 6,008,984 107.5 
1916 6,620,415 118.6 
1917 6,815,811 122.0 
1918 6,176,547 110.5 
1919 5,919,520 106.0 
1920 5,843,200 104.5 
1921 5,406,721 96.8 
1922 6,087,520 109.0 
1923 6,513,696 116.7 


Similarly the price of a commodity may be expressed as 
a relative, the price at a given date or for a given period 


serving as base. 
169 


170 INDEX NUMBERS OF PRICES 


TABLE 34 
Average Price of No. 1 Northern Spring Wheat, Minneapolis, 1913, 
1919-1923 
(Average Price in 1913 = 100) 

Year Average Price per Bushel Relative Price 
1913 $0 .8735 100 
1919 2.5660 294 
1920 2.5581 293 
1921 1.4660 168 
1922 1.3450 154 
1923 1.1810 135 


The representation of the terms in a time series as rela- 
tives, with reference to a fixed base, permits a ready com- 
parison of the values for different dates and permits the 
trend of the series to be perceived much more easily 
than when the data are presented in their original form. 
Comparison of the trends of different series is also facili- 
tated. 

Though the term index number has been applied to such 
relatives it is better practice to reserve the term for figures 
which represent the combination of a number of series. 
The series to be combined may relate to prices, production, 
consumption, wages, volume of trade or to any factor 
subject to temporal variation. Quite complex problems 
may be involved in the construction of any one of these 
special forms of index numbers, but the essential aim in all 
cases is to secure a single, simple figure which will indicate 
the net resultant of the forces acting on the constituent. 
series. 

A simple index number may be constructed to represent 
the course of coal and petroleum production in the United 
States. Inthe making of such an index it is necessary to 
combine in some way production figures for bituminous 
and anthracite coal and petroleum. The following are the 
production figures and the corresponding relatives for the 
three series, from 1910 to 1923: 


INDEX NUMBERS OF PRICES 


TABLE 35 


171 


Production of Bituminous and Anthracite Coal and Petroleum in the 
United States, 1910-1923 


(Production in 1910 = 100) 


Year 


1910 
i911 
1912 
1913 
1914 
1915 
1916 
1917 
1918 
1919 
1920 
1921 
1922 
1923 


Prod. of Bit. Coal 
Mullion sh. tons 


417. 
405. 


450 
478, 
422 


442, 
502. 


551 


579. 
465. 
568. 
415. 


404 
545 


WNOVOROAAWRHY OS mu 


Prod. of Anthr. Coal 

Rel. || “Million sh. tons | ®4- 
100 84.49 100 
97 90.46 107 
108 84.36 100 
115 91.52 108 
101 90.82 107 
106 89.00 105 
120 87.58 103 
132 99.61 118 
139 98.83 117 
112 88.09 104 
136 89.60 106 
100 90.44 107 

97 52.90 63 
131 95.2 113 


Prod. of Petrol. 
Mill. bbls. 


557 
725 


ARMWOROHDHDRORO 


A rough index of fuel production, based upon these three 
series, is desired. It is impossible, obviously, to add the 
original figures, as the units are not the same. This diffi- 
culty may be avoided by using the relative figures, and a 
simple average of the three relatives for a given year may 
serve as the required index. The following is the series of 
index numbers thus secured: 


TABLE 36 


Index Numbers of Coal and Petroleum Production in the United 


¢ nike) fe) w\\etiw!e @ (6118, '@) 0) /8, 16.16 


ia S foe's) a0 (0) 91.8) 16) 8) 8.5 e 


eines) wettest leo el We, me Mp he 


eye fone) (@,,0).4 be) 10) ;6) 91.4) (eu 8ice 


States, 1910-1923 
(Production in 1910 = 100) 


Index Year 


172 INDEX NUMBERS OF PRICES 


In securing this index, by adding the three relative 
figures for a given year and dividing by three, equal weight 
has been given to each of the three series. Such an index of 
equally weighted relatives has been termed an unweighted 
index, but the term is misleading. Weights are used, the 
weights in this case being equal. It is clear that this index 
based upon equal weights does not reflect faithfully the 
three series combined in the present instance. For the 
three series are not of equal importance, as the system of 
equal weights assumes. The following figures showing the 
wholesale values in exchange in 1921 of bituminous coal, 
anthracite coal and petroleum indicate the relative impor- 
tance of the three series: ! 


Mineral Wholesale Value in Exchange in 1921 
Bituminous Coal..... $1,948,000,000 
Anthracite Coal...... 731,000,000 
IRetroleuniaee eee 712,000,000 


Roughly, these stand to each other in the relation of 3, 1 
and 1, and these weights may be assigned to the series under 
consideration. An index for each year may be computed, 
using these weights. The following example, showing the 
calculations for the years 1910 and 1923, will illustrate 
the method: 


TABLE 37 
Computation of Weighted Index Numbers of Coal and Petroleum 
Production 
Relative Relative 
Mineral Production, | Wt. | Wt. x Rel.|| Production, | Wt. |Wt. x Rel. 
1910 1923 ‘ 
Bituminous Coal. . 100 131 3 393 
Anthracite Coal... 100 113 1 113 
346 1 346 


Petroleum... 100 


5 852 


Index of fuel production, 1910 = 500 + 5 = 100 
Index of fuel production, 1923 = 852 + 5 = 170 


1 The figures have been compiled by the U. S. Bureau of Labor Statistics. 


INDEX NUMBERS OF PRICES 173 


The value of the index thus secured for each of the four- 
teen years covered is shown below: 


TABLE 38 


Weighted Index Numbers of Coal and Petroleum Production in the 
United States, 1910-1923 


Year Index Year Index 
OVO Meret ate ce cr ociats 100 i WP es Soe a Ee ee A 135 
Mahe trie er sects, eresokerers 101 LOLS seta ey cn to Oe 141 
DO) erie neice oars 106 ONO etencat oe etn ers 124 
BO US mecpenstoneret an cietere sass ce 114 OQ ON rater keer a etek: 145 
NOVA eee con tie: 107 IEA Le oR hte bad Meee ee 8 126 
NOUS Nr. tee reettes 111 OO OS Siticc Seren te 124 
OPC haere boars bors ea OO ee ret RECN ets ee 170 


Quite important differences between the two series of 
index numbers are apparent. The second series, which is 
the more logically weighted, is, of course, the more accurate 
of the two, and gives a more faithful representation of the 
combined effect of the forces affecting the three series. 

Another type of index number is one in which the items 
in the constituent series are totaled, the aggregate figure, 
instead of an average, serving as the representative of the 
entire group. Such aform of index number may be con- 
structed only when the different series are all expressed in 
the same unit. This form is frequently employed as an 
indication of changes in the level of prices, the aggregate 
cost of a bill of goods at one period being compared with 
the aggregate cost of the same goods at other dates. The 
following figures illustrate this type of index. 


TABLE 39 


Bradstreet’s Index Numbers of Wholesale Prices in the United 
States, 1913-1923 


Year Index Year Index 
TE NS 356; as eyereatenencrceNcaeaad $ 9.21 OUD oe pee aes te veea atresia 18.66 
OTA a eae a tecar ohocatern ae 8.90 TYTOA ect nents Cro CMR OOS & 18.81 
TOUS ee trate ves arene 9.85 LOOA iran cnouhekere oss UN atei7/ 
ONG se neeck este OG 11.82 OD ee wake wisrenele ote ete 12.12 
WONG eee. . sacoser ss 15.64 18 reas aretenchainars 13.40 


174 INDEX NUMBERS OF PRICES 


Each of the yearly aggregates quoted above is the sum 
of the average prices during the year of 96 commodities at 
wholesale. Before being added all the prices are reduced to 
the “per pound” basis, so that a certain degree of compara- 
bility is secured. Such an index may be readily changed to 
the relative form, any year being taken as a base and the 
totals for the other years expressed as percentages of the 
figure for the base year. 

The examples which have been given will indicate some 
of the many forms which index numbers may take. The 
term may refer to a simple relative number; it may be 
applied to an average of relative terms, or to an aggregate 
of relative or absolute figures. In all the examples given 
the index has been designed to serve as a measure of change 
over a period, as an indicator of changes in the values of 
time series. The term may have a much broader meaning 
than this. An index of the ability of salesmen might be 
constructed by giving numerical values to the factors 
determining their usefulness and securing an average of 
these values. An index of the efficiency of different depart- 
ments in a business enterprise might be constructed. In 
any case, the construction of an index involves the reduction 
to comparable terms of a number of different factors and 
the replacement of these several terms by a single figure 
which may serve as their representative. Comparison is 
thus facilitated, whether it be comparison over time or 
comparison with indices secured by averaging terms re- 
lating to a similar unit. In all its forms (except the first 
limited and exceptional meaning in which it applies to a 
simple relative) an index number is thus a type of statistical 
average, and such numbers, in their construction and use, 
are subject to all the rules and limitations set forth in the 
development of the subject of averages. 

In the present work we are interested only in the applica- 
tion of the index number device to time series. So varied. 
however, are the rules and practices relating to its applica- 


INDEX NUMBERS OF PRICES 175 


tion to different types of time series that certain of these 
types must be treated separately. Our first concern is with 
index numbers of wholesale prices. 


Prick CHANGES 


When price movements are surveyed in detail it is 
difficult to perceive order, or any definite trend. We find a 
multiplicity of conflicting movements. The following price 
quotations, taken at random, are roughly typical of what 
would be found were the entire field of prices canvassed in 
order to compare price movements from month to month: 


Tassie 40 
Commodity Prices at Wholesale 1 


Price Price 
Commodity Unit (Wholesale) | (Wholesale) 
October, 1923 |November, 1923 


Brick, common building, average of 


ALG DIICER Me eri es niente. sats oer 1000 $14 .752 $14 .746 
Pig Iron, basic, Valley furnace....... Gross ton 23.500 20.875 
Cement, Portland, average of plant 

IDLICES Ane er eae Sates oe} lie Bbl. 1.893 1.842 
Minseed: Oily raws INe Yes «cess on oes Gal. 943 .910 
Steel Billets, Bessemer, Pitts........ Gross ton 40 .000 40.000 
ANNE tte ULAttS ac Pyaiat os Serato: 100 lbs. 5.500 5.500 
Copper, electrol., refinery........... Pound . 126 128 
Merl eIN oe Vc hy tore, Ar tamer cases tics Pound .069 .069 
PATACHENG No apo to hoee iel Hee one ot octsek Pound .067 .067 
Goal, Anthr. Chestnut, N.Y......... Gross ton 11.471 11.478 
Coal) Bit. Mine run, Chi............: Net ton 4.600 4.525 
Crude Petroleum, Penn., at wells.....} Bbl. 2.500 2.388 
Gasolene, Motor, N. Y.............: Gal. .185 .170 
Cotton, Middling;"NiiOe..c eens ora Pound 292 .839 
Wheat, #2 Red Winter, Chi.......... Bu. 1.097 1.061 
Sugar Granulated: Navas. ore. - ss Pound .090 .087 


Of the sixteen commodities listed, four showed no price 
change at all between October and November, 1923, three 


1 As compiled by the United States Bureau of Labor Statistics. 


176 INDEX NUMBERS OF PRICES 


showed price increases, and in nine cases prices declined. 
Some of the price movements were inconsiderable, while 
some marked very material changes. Such, as seen here 
in miniature, is what happens in the price system as a whole. 
All prices do not, with absolute uniformity, move up or 
down or remain constant. Each of the thousands of com- 
modities traded in on the markets of any country, or of the 
world, moves in its own individual way, subject to a variety 
of influences. Yet it does not act in isolation. In its price 
movements it affects other commodities, and is affected 
by them. And, in addition to the forces peculiar to each 
commodity, there are broad forces which act throughout 
the price system, affecting all commodities. It is the busi- 
ness of the economic statistician to bring order out of the 
chaos of price movements taking place at any given time 
and, out of the multiplicity of minor movements, to pick the 
broad trends which affect the whole economic system. 
The forces which bring about the price movements which 
are to be studied are numerous and complicated, but some 
general conclusions may be drawn with regard to them. 
There are, in the first place, all those changes in production 
and consumption conditions peculiar to individual commodi- 
ties and affecting directly the prices of those commodities. 
The opening of new fields, improvements in production 
technique in individual cases, changes in fashion and the 
transfer of demand from some commodities to others, 
changes in demand and supply with the seasons — all these 
are causing constant price readjustments. These are the 
changes which in ordinary times are most obvious, which 
are brought home directly to the individual merchant or 
consumer. Such changes affect the whole price system, 
as has been pointed out, but not in general by causing 
upward or downward movements in the system as a whole. 
Such general movements are due to forces which are 
broader in their scope. The general improvement in pro- 
duction technique and the increase in the productivity of 


INDEX NUMBERS OF PRICES 177 


human labor which has resulted have, by increasing the 
supply of commodities available for consumption, affected 
prices. Changes in monetary systems and, in particular, 
changes in the gold supply have exerted a direct and imme- 
diate influence upon prices, by affecting the supply of money 
in circulation. Similar in character have been changes in 
banking and credit systems and changes in commercial 
practice which have affected the use of credit instruments 
and the rapidity of circulation of money and credits. All 
these forces affect prices, though their incidence is not so 
specific as that of the factors affecting individual commodi- 
ties directly. 


PuRPOSE OF GENERAL INDEX NUMBERS OF 
WHOLESALE PRICES 


These separate forces cannot be isolated and evaluated. 
Their joint action causes a perplexing variety of price 
changes. In studying these changes the problem might be 
approached from several different points of view. It might 
be desired to study the readjustments which take place 
within the price system, to determine the nature and degree 
of the shifts which changing conditions cause within the 
system. Such a study would yield valuable information 
as to the behavior of prices and the character of their inter- 
relations. Our immediate problem, however, is the deter- 
mination of the net resultant of all these forces. Do all 
price movements cancel each other so that while some 
prices move up and some down there is no net change? Or 
is there at a given time a preponderance of movements in 
one direction, causing the level of general prices to move 
upward or downward? If there is such a trend, what is it, 
and how may it be measured? Are the statistical methods 
which have been explained in the earlier sections applicable 
to the solution of this problem? 

The first step in this study involves the answering of the 


178 INDEX NUMBERS OF PRICES 


last question asked. It has been brought out that methods 
of summarizing quantitative data have been developed, 
but that these methods are applicable only when certain 
conditions are fulfilled. An average, it was noted, has no 
significance unless it represents a distinct central tendency 
in a mass of homogeneous data. Moreover, the type of 
average to be employed depends upon the character of the 
distribution it is to represent. Until the distribution of 
the original data is studied no average or other statistical 
measure can be intelligently employed. We must first, 
then, determine what the raw materials of the problem are, 
and study the frequency distribution secured when these 
raw materials are organized. 

For the present a quite general purpose will be assumed, 
the determination of the change in the level of general 
wholesale prices between two specific dates. This is equiva- 
lent, of course, to measuring the change in the value of 
money between two given dates. The raw materials of the 
problem consist of a number of price quotations on indi- 
vidual commodities, quotations being secured for the two 
dates which are to be compared. Each pair of quotations 
measures the change in the price of a single commodity, 
a change caused by the interplay of many forces. When a 
great many such price quotations are brought together we 
have a mass of data representing the interaction of a multi- 
tude of forces, some individual and specific in their incidence, 
some general, affecting the prices of large groups of com- 
modities or of all commodities. What we seek to determine 
is the net resultant of all these factors which are affecting 
prices. We seek a measure of the composite effect of the 
numerous forces which are causing individual prices to rise 
or fall. This measure will constitute an index number of 
wholesale prices. 

The unit with which we must deal is a single price varia- 
tion. Whether the statistical methods with which we are 
familiar may be employed in the organization and analysis 


INDEX NUMBERS OF PRICES 179 


of a number of such units depends upon the behavior 
of such units in mass. The following examples illustrate 
the frequency distributions secured when these data are 
classified. 


FREQUENCY DISTRIBUTIONS OF PricE Ratios 


Each price variation is, of course, a ratio, the ratio of the 
price of a commodity at a given date to the price of the 
commodity at another date. The ratios may be reduced 
to a comparable basis by putting them all in the form of 
relatives, of the type illustrated in the earlier examples of 
index numbers. Thus, using one of the pairs of price quo- 
tations given above, the ratio of the price of pig iron in 
November, 1923, to the price in October, 1923, is $20.875: 
$23.500 which, in the form of a relative, becomes 88.8 : 100. 
In constructing the following frequency table, the prices at 
wholesale in 1914 of 346 commodities were expressed as 
relatives, with the 1913 price as a base in each case. The 
distribution of these 346 relative numbers is as follows:! 


TABLE 41 


Distribution of the Relative Prices of 346 Commodities in 1914 
(Average prices in 1913 = 100) 


EG ee Mid-points No. of cases Percentage of total 
m if number of cases 
62.5- 67.4 65 1 3 
67 .5— 72.4 70 1 3 
G2 .5— "17 4 15 5 1.5 
17 .5-— 82.4 80 7 2.0 
82.5— 87.4 85 20 5.6 
87.5-— 92.4 90 35 10.0 
92.5-— 97.4 95 51 14.5 
97 .5-102.4 100 134 39.0 
102.5-107.4 105 50 14.5 


(Continued on next page.) 

1 The 346 commodities included were those employed by the U. S. Bureau of 
Labor Statistics in the construction of its index of wholesale prices. The original 
_ figures, and the relatives, appear in Bulletin 269, of that Bureau, on “ Whole- 
sale Prices, 1890-1919.” 


180 INDEX NUMBERS OF PRICES 


TasLE 41 (Continued) 


. . Mid-points No. of cases Percentage of total 
B® m i number of cases 

107 .5-112.4 110 Q1 6.0 
112.5-117.4 115 12 3.5 
117. 5-122 .4 120 3 1.0 
122.5-127 .4 125 Q .6 
127 .5-132 .4 130 2 6 
132.5-137.4 135 
137 .5-142.4 140 1 .3 
142.5-147 .4 145 
147 .5-152 .4 150 1 3 

346 100.0 


The frequency polygon representing this distribution 
appears in Fig. 47. For purposes of comparison with similar 


ee 


Frequency (Pere: 
oS 


HAR 
BBZERS3E: 


60 100 WO 0 130 140 = 150 
Relative Price 
Fie. 47. — Frequency Polygon: Distribution of Relative Prices of 346 
Commodities in 1914. (Average prices in 1913 = 100) 


— 
[S) 


distributions the figure shows the percentage distribution. 
The correspondence of this frequency distribution to the 
standard types portrayed in earlier sections is obvious. 


INDEX NUMBERS OF PRICES 181 


There is the same marked concentration about a central 
tendency, in this case a tendency of prices to remain stable, 
for 39% of all the cases showed a change not exceeding 2.5% 
from their prices in the base year. There is also, in this 
ease, a fairly symmetrical distribution about this central 
tendency, though the range above the mode is greater than 
the range below, a fact of considerable significance. With- 
out at present considering the question as to which average 
might best be used to represent the central tendency in this 
distribution, it is apparent that the use of some average is 
quite legitimate. 

The example just given has been based upon price varia- 
tions from one year to the next, over a period during which 
the level of general prices showed no change. W. C. 
Mitchell gives a much more comprehensive illustration, 
based upon the distribution of 5578 price variations from 
one year to the next over the period 1890-1913, which shows 
the same general grouping. ‘The excess of the range above 
the mode over the range below is somewhat more pro- 
nounced, in connection with which fact it should be noted 
that prices were rising during most of the 23 years covered. 
The distribution secured by Mitchell is shown in Fig. 41. 

The inertia of prices is most conspicuous when year to 
year price changes are studied. It is therefore advisable to 
consider the character of price variations over a longer 
period, that we may learn whether the same type of dis- 
tribution is secured. Two examples are given, one of price 
changes over a ten-year period, a period so chosen that the 
price level at the end was approximately the same as at the 
beginning; the other of price changes over a five-year period 
characterized by rapidly rising prices. The table following 
shows the distribution of 222 price variations, prices in 
1900 being expressed as relatives on an 1890 base. The 
general price level, it should be noted, fell from 1890 to 
1896 and rose from 1896 to 1900; prices in 1900 were 
approximately one per cent lower than in 1890. 


182 INDEX NUMBERS OF PRICES 


TABLE 42 
Distribution of Relative Prices of 222 Commodities in 1900 


(Average prices in 1890 = 100) 


3 ; Mid-points No. of cases Percentage of total 

Relative Prices in f f number of cases 
42,.5- AT .4 45 2 1.0 
47.5-— 52.4 50 1 5 
52.5-— 57.4 55 3 1.5 
57.5- 62.4 60 3 HES 
62.5- 67.4 65 3 I Ness 
67.5-— 72.4 70 3 1.5 
72 .5— 77.4 15 8 3.5 
77.5— 82.4 80 12 5.0 
82.5-— 87.4 85 19 8.5 
87.5-— 92.4 90 20 9.0 
92.5— 97.4 95 30 13.5 
97 .5-102 .4 100 QT 12.0 
102.5-107.4 105 QA CEO 
107 .5-112.4 110 14 6.0 
112.5-117.4 115 18 8.0 
117.5-122.4 120 10 4.5 
122 .5-127.4 125 a 3.0 
127 .5-132 .4 130 9 4.0 
132 .5-137 .4 135 Q 1.0 
137 5-142 .4 140 4 2.0 
142 5-147 .4 145 1 5 
147 .5-152.4 150 1 5 
152.5-157 .4 155 i 5 
222 100.0 


These data are plotted in the form of a frequency polygon 
in Fig. 48, the percentage distribution being shown. (In _ 
comparing Figures 47 and 48 the difference in the vertical 
scales should be noted.) 

The distributions depicted in Figs. 47 and 48 differ 
materially. "The range of the variations is greater in the 
second case, a condition naturally to be expected because of 
the longer period covered. Secondly, a very much smaller 
percentage of cases is concentrated in the modal group, 
though there is still a pronounced central tendency. Both 


INDEX NUMBERS OF PRICES 183 


distributions, as plotted on the arithmetic scale, are fairly 
symmetrical. (In considering this fact, it should be re- 
membered that the price level as measured by the arith- 
metic mean was practically the same at the two dates com- 
pared in each case.) In the first case the concentration 
about the central tendency is much more marked, and the 
deviations of individual price ratios from the arithmetic 


aD 


reentage) as 


Frequency (Pe 


550 60 70 80 90 100 10 120 130 140 150 160 


Relative Price 


Fic. 48.— Frequency Polygon: Distribution of Relative Prices of 222% 
Commodities in 1900. (Average prices in 1890 = 100) 


mean are smaller. This distribution resembles one which 
would be secured from highly accurate physical measure- 
ments, or the distribution of shots from a very accurate 
piece of artillery. The second curve corresponds to one 
representing less accurate physical measurements, or to 
the distribution of shots from an old or inaccurate field 
piece. The modal value occurs less frequently and the 
deviations from the mean are greater. It has been es- 
tablished that the longer the period covered in price com- 
parisons such as those made above, the more pronounced 
is the tendency shown in the second curve. The value of 
the maximum ordinate falls and the range of the distribu- 


184 INDEX NUMBERS OF PRICES 


tion increases. The curve becomes flatter and more ex- 
tended as the time interval increases. And, quite obviously, 
as this process goes on the representative character of any 
type of average declines. Unless there is concentration 
about a central tendency an average is merely an abstrac- 
tion, without concrete significance. 

It is possible at this point to state as a tentative conclu- 
sion that price variations are capable of statistical measure- 
ment, that they may be appropriately represented by an 
average value, provided the period covered is not too long. 
No definite statement can be made as to the maximum 
period over which price variations may be measured. 
Index numbers having accurate and significant values must 
be based upon comparisons over relatively short periods, 
the most accurate being year-to-year comparisons. Index 
numbers designed merely to show general trends in prices 
may cover longer periods, though the makers and users of 
such index numbers should realize their limitations.! 

The two examples given above represented comparisons 
between dates when prices, as measured by the arithmetic 
mean of a number of variations, were at approximately 
the same level. These are, of course, particular cases, and 
it is desirable to study the distribution of price variations 
over a period characterized by changes in the general level 
of prices. The following table shows the distribution of the 
relative prices of 1437 commodities in 1918, average prices 
during the period July, 1913, to June, 1914, serving as base.? 

1 These conclusions are largely based upon studies made by W. C. Mitchell. ° 
Cf. “The Making and Using of Index Numbers,’ Bulletin 284 (Wholesale 
Price Series), United States Bureau of Labor Statistics. 


2 Computed by the Price Section of the War Industries Board; reproduced in 
Part 1, Bulletin 284, U. S. Bureau of Labor Statistics, 70. 


INDEX NUMBERS OF PRICES 185 


TABLE 43 
Distribution of the Relative Prices of 1437 Commodities in 1918 
(Average Prices, July, 1913, to June, 1914 = 100) 


etiires Prices Mid-points No. of cases Percentage of total 
m f number of cases 

36 36 1 = 
49 49 1 He 
50- 69 60 4 5 
70- 89 80 17 1 @ 
90-109 100 61 4.3 
110-129 120 64 4.5 
130-149 140 130 9.0 
150-169 160 212 14.7 
170-189 180 219 1s) 504 
190-209 200 164 11.4 
210-229 220 135 9.4 
230-249 240 104 ee 
250-269 260 76 5y,.8) 
270-289 280 54 3.8 
290-309 300 42 3.0 
310-329 320 30 Om 
330-349 340 B31 acl 
350-369 360 16 1.1 
370-389 380 13 9 
390-409 400 "Ff .5 
410-429 420 7 és 
430-449 440 8 6 
450-469 460 4 53 
470-489 480 4 3 
490-509 500 4 .3 
510-529 520 5 A 
530-549 540 3 3 
550-569 560 4 .o 
587 587 1 * 
627 627 il “a 
727 ELE 1 * 
730 730 1 = 
743 743 1 ba 
761 761 1 y 
784 784 1 - 
826 826 1 s 
848 848 il a 
900 900 1 
1165 1165 il * 


(Continued on neat page.) 


186 INDEX NUMBERS OF PRICES 


Taste 43 (Continued) 


, : Mid-points No. of cases Percentage of total 
Relative Prices eh f panberenone. 
1356 1356 1 ¥ 
1585 1585 1 sf 
1764 1764 1 - 
2049 2049 1 Ly 
2863 2863 1 - 
3009 3009 1 * 
1437 


* Less than one tenth of one per cent. 


This distribution is shown graphically in Fig. 49. (The 
scales in this figure are not the same as those employed in 
the two figures preceding). 

A study of this distribution bears out the conclusion 


centage) 


S 


Frequency (Per 
bh DO 


0 200 300 400 600 

(After Mitchell) Relative Price 

Fic. 49. — Frequency Polygon: Distribution of Relative Prices of 1437 Com- 
modities in 1918. (Average prices, July, 1913, to June, 1914 = 100) 


reached from the two examples preceding. There is a 
central tendency sufficiently pronounced to be well repre- 
sented by an average. In this case, moreover, the modal 


INDEX NUMBERS OF PRICES 187 


group is that with a mid-point of 180, so that the tendency 
toward concentration cannot be attributed to inertia, but 
to the presence of external forces affecting the price system 
as a whole. There is, however, one marked point of differ- 
ence between this distribution and the two others. , The 
tendency toward skewness, which was in evidence in the 
first example, is pronounced in this case. The curve, as 
plotted on the arithmetic scale, is markedly asymmetrical. 
The greatest concentration is near the lower limit of the 
scale and a long tail, extending in fact far beyond the limit 
of the chart, tapers out to the right. The details of this 
condition are most clearly apparent from the table. Starting 
from 100 as a base, the smallest relative price is 36, an 
absolute decline of 64 points on the scale. The highest 
relative price is 3009, an absolute increase of 2909 points. 
It is true that this occurred during a period characterized 
by a pronounced rise in the general price level. (The index 
of the War Industries Board shows that prices in 1918 were 
94% above those in the base period.) That accounts for 
the location of the modal group, but it does not explain the 
nature of the distribution. In explaining the shape of the 
frequency curve we shall be preparing to answer a question 
which is fundamental in index number construction, namely, 
What type of average should be employed in securing a 
figure representative of the central tendency in a distribu- 
tion of price variations? 


Tur PRosBLeM oF AVERAGING PRICE VARIATIONS 


A price increase, expressed as a relative, has no upper 
limit. An increase of 100, 500, 1000 per cent or more is 
conceivable and possible. The greatest price increase noted 
by the War Industries Board in its study of prices during 
the war was one of 4981 per cent, in the case of acetipheneti- 
din. But 100 per cent is the maximum decline possible, as 
that would mean that the price of a commodity had fallen 


188 INDEX NUMBERS OF PRICES 


to zero. This is the explanation of the skewness noted in 
the curves shown. When any considerable number of price 
ratios are tabulated the corresponding frequency curve, 
plotted on an arithmetic scale, shows this characteristic 
feature, a feature which is most conspicuous during a 
period of rising prices. 

The bearing of this fact on the selection of an appropriate 
type of average has been brought out in the discussion of 
averages. When ratios are being averaged the geometric 
mean is the appropriate average to employ. This fact is 
clearly demonstrated by comparing the different averages 
for the distribution last shown. The arithmetic mean of 
the 1437 price ratios there tabulated is 217, while the mode 
is 180. When a mean departs any considerable distance 
from the mode it obviously loses something of its repre- 
sentative character. The geometric mean in this case is 
194, a value much closer to both the mode and the median 
(191). Quite apart from other arguments, this relationship 
between the different averages indicates that the geometric 
mean is to be preferred to the arithmetic, in the present 
case. 

The fact was noted in the chapter on averages that, when 
a frequency distribution plotted on a logarithmic scale 
approximates the normal curve more closely than when 
plotted on the arithmetic scale, this indicates that the data 
follow the geometric rather than the arithmetic law of 
dispersion. Figure 50 shows the distribution of the 1437 
relative prices in 1918 on the 1913-14 base, when the 
relative prices are plotted on a logarithmic scale. The 
pronounced asymmetry which is apparent when the data 
are plotted on an arithmetic scale is gone, and a curve very 
closely resembling the normal type is secured. 

While this tendency of price ratios to follow a geometric 
law of dispersion is a strong argument for the use of the 
geometric mean in averaging these ratios, this does not mean 
that price index numbers may not be constructed by other 


INDEX NUMBERS OF PRICES 189 


methods. In general, the geometric mean is the logical 
average to use, but good index numbers may be constructed 
in other ways. This point will be developed in a subsequent 
section. 

The argument which has been developed in the preceding 
pages may be briefly summarized. Before discussing the 


16r 
14 


enta 
—" 
oy 


Frequency (Pere 
Nb oO ® 


SS 40 506070800100 150 200 300 400 500600700 


Relative Price 
Fic. 50. — Frequency Polygon: Distribution of Relative Prices of 1437 Com- 


modities in 1918, with Relative Prices plotted on Logarithmic Scale. (Average 
prices, July, 1913, to June, 1914 = 100) 


practice of index number construction it was considered 
advisable to study the character of the raw materials and 
the nature of the distributions secured when these materials 
are brought together, in order to determine whether 
ordinary statistical methods are appropriate. The raw 
materials, we have seen, consist of individual price varia- 
tions, expressed as ratios. When a number of these ratios 
are assembled a frequency distribution is secured which 


190 INDEX NUMBERS OF PRICES 


somewhat resembles the distribution of data following the 
normal law of error. A central tendency, which may 
legitimately be represented by an average, is apparent in 
the distribution of price variations. The central tendency 
is less marked, however, and the deviations from it are more 
pronounced, the longer the period covered in the price 
comparison, so that an average becomes less representative 
as this period increases. In addition, a tendency toward 
skewness has been noted, and this was seen to be quite 
pronounced in a period of rising prices. This skewness is 
due to the fact that we are dealing with ratios which 
have a definite lower limit and no upper limit. Referring 
to the previous discussion of averages, we have found that 
price variations constitute a type of material following the 
geometric law of dispersion, and that if we are seeking the 
average of these variations the geometric mean should be 
employed. While this general conclusion holds true, it has 
been noted that for certain purposes other methods of index 
number construction may appropriately be employed. 

The problem as to the type of average which should be 
used, and the connected problem as to the best method of 
weighting, have occupied the center of the stage in all dis- 
cussions on the theory of index number construction. 
Assuming that the purpose of an index number of wholesale 
prices is to serve as a measure of the average ratio of change 
in general prices, the best theoretical solution of the first 
problem has been suggested. The question of meen ne 
introduces a new element. 

The chief reason for the employment of a system of 
weights in the construction of index numbers has been 
indicated above. Whether the index number is to measure 
variations in the physical volume of production, in the 
volume of trade, or in prices, all the quotations and ratios 
which must be averaged in securing the index number are 
not of equal importance. In the construction of an index 
of production coal is more important than lead; in the con- 


INDEX NUMBERS OF PRICES 19] 


struction of an index of wholesale prices the price of wheat 
is more important than the price of flaxseed. The problem 
of measuring the relative importance of the different com- 
modities to be included in an index number is one which 
has been solved in various ways. In large part the selection 
of weights is dependent upon the particular purpose which 
a given index number is to serve, as well as upon practical 
difficulties involved in securing the data upon which weights 
may be based. Several examples of varying weights are 
shown in the section which follows. A further discussion 
of the subject will follow the consideration of these examples. 


VARIETY OF Metuops Empiorep In InDEx NUMBER 
CONSTRUCTION 


Many methods have been and are being employed in the 
construction of index numbers of wholesale prices. Usage 
varies for many reasons. There are some differences of 
opinion as to which is theoretically the best method. There 
are practical difficulties to be surmounted, difficulties which 
inevitably cause differences in practice because of the vary- 
ing resources of the agencies engaged in these tasks. And 
there are, finally, certain differences due to the varying 
purposes for which index numbers are constructed. 

Prevailing differences in practice and differences in the 
results secured by the employment of various methods in 
the construction of index numbers can perhaps be illus- 
trated most effectively by the application of a number of 
methods to the same data. The table which follows presents 
the raw material to which these various methods are to be 
applied — the average farm prices, on December 1, of 
twelve leading crops, from 1910 to 1923. 


INDEX NUMBERS OF PRICES 


192 


Surpuodsaiioo ain3y oN 


‘AypouruT0d sty} 10} parndes oq pfnoo sorad WIR} oY} 0} 


‘Toquiaoaq] Jo YWour ay} Ur yonpord Mer aq} j9 dorad oyesofoyM ayy st qvod Yowo 10} uaals IvBns Jo vod oy, ; 


‘Spay puv sdoiy ‘sayyva yy wWoary o1e pajonb sao1id 941 ¢%-1%61 potted oy} 10g “cQT TIL ‘1°A ‘Pld ‘8027870199 armouoog fo 


naa .,‘StaquNN xepuy 10j VINUIOT SAIYShT,, “JO ‘suosiag “ Wore A Aq porquiosse arom soInsy eseY2 0S-O16L ported oy} IOF 1 


3S a aa a ee ee ee 


SOT ¥S6° oS6° 6811 899°3 816° L 968 °T 688° 906° 
LE9* %69° 169° 816 1 StS °T 9T9'T 099 °T 166 'T FEB" 
8016 FILS 19h TL 99L'T 686° 5 LOS 996% 987% OFL T 
£06" 183° 661° 116° 068 * 086° OFS" LyL” 160° 
OFS” 96g" 6IF* LOL” 01ST LI6° LECT 188° 919° 
LPLO- 0490 ° 0L80° LLGO~ 860° 860° 6L90° 6990° 0690" 
S68" G89" IOL'T POLL 909 1 S6L°T 886 'T 19 'T L19° 
SIP” 6° 608° OLE” SIL” 604° 999° Fog" 196° 
£66" 600 °T 966° SPP T ISL 3 &F0°S 800°3 609 ‘T 616° 
LO°FI 69 SI Il SL 6L'9T $¢ 61 $106 60 °L1 6611 69 OL 
Os” 885° 69T° OFT” 99s" 916° LLO™ 961° SIT 


LoL” $) L99° $! Sop’ $| LL9° $! LEE T ¢ $98 TL $] 6L5'T $] 688° $] S19" 


———| 


S661 S261 T86L 0@61 6161 SI6I LI6L 9T6L ST6L 


$ 


46° 898° 266° L6L~ 
998° 89° S9E ° 88° 
096 'T 66L°T LEL'T 1@8 1 
860° 861° 80." ¥60° 
Sho" LE¢° GO¢° 698° 
680° $980" SOF0° F6F0° 
L8¥° 189° S0¢° 66L° 
88° 68° 618° OSF” 
986° 664° 09L* tL8° 
ol Il SP OL 64°11 66 FL 
890° rch 611° 880° 


rr9" ~$) 169° $] L8F° $! BI9° 


PI6L SI6I SI6L TI6I 


1 S66I-O16L ‘sdoig buspway aajany fo ‘Ty Haquavag Uo ‘sag wing abosoap 


Py TIAVY, 


819° 
STL" 
LIg 3% 
$60" 
8Lg° 
680° 
Leg 
rts” 
£88" 
FL OL 
TFL” 
$| 08%" 


OI6L 


$ 


(}10ys) 


ng 
ng 
ng 
oe 
ng 
a 
ng 
ng 
ng 


ng 


Gg} 


"21 
tals 


ean 7 | 
= "**Oo0BgoT, 
arith KOC 
ovseuien, Jean 


INDEX NUMBERS OF PRICES 193 


EXPLANATION OF SYMBOLS 


The symbols to be employed in the computation of 
different types of index numbers have the following mean- 
ings: 

po : price of a given commodity at time “0” (the base period). 
go : quantity of same commodity at time “0”. 
pi : price of same commodity at time “1”’. 
qi’ : quantity of same commodity at time “1”. 
po’: price of a second commodity at time “0”. 
qo’: quantity of second commodity at time “0”. 
pi’: price of second commodity at time “1”’. 
qi’: quantity of second commodity at time “1”. 


—-: a price relative (relation of price of a given commodity at 


time “1” to price of same commodity at time “0”’). 
—: a quantity relative. 


Po : price level at time “0”. 
P, : price level at time “1”. 


SimepLE InpEx NuMBERS OF PRICES 


In his exhaustive analysis of methods of index number 
construction ! Irving Fisher distinguishes six fundamental 
types: the aggregative (or price aggregate), the arithmetic, 
harmonic, geometric, median and mode. The latter has 
never been employed in a practical way, and may be 
omitted. The characteristics of the five remaining types 
may be brought out by considering each of them in its 
simplest form, before examining the more complicated 
combinations. 


AGGREGATES OF ACTUAL PRICES 
In the construction of index numbers of the simple 
aggregative type, commodity prices pertaining to a given 
1 The Making of Index Numbers, Houghton Mifflin Co., 1922. 


194 INDEX NUMBERS OF PRICES 


date are added; general price changes are measured by 
comparing the results thus secured for different dates. 
Using the above symbols 


Pi _ 2p 
Po 2p 

When such index numbers are constructed from the data of 

Table 44 the following results are secured. The actual 

aggregates are given in column (2); to facilitate comparison 

the same figures are reduced to relatives, with the 1910 


aggregate as base, in column (3). 


TABLE 45 
Index Numbers of Farm Crop Prices 


(Aggregates of Actual Prices) 


(1) (2) (3) 
Index Index, Relative 
Year (aggregate of actual prices) (1910 = 100) 
OTOP $18 .9653 100 
OU ack. 21.5814 114 
LOO oh 17.3785 92 
NOUS Bes eer: 18.5124 98 
LOAN. ae 17.4722 92 
LOU ene ces 17.3540 91 
DOE SS esdicas 21.5739 nas 
DN eae 30.5142 161 
LOWS Rea, 33.8198 178 
LOW OF. eee 35.7938 189 
LOQOKe eee 25 8247 136 
N87 Wee ree ace ke 18.7790 99 
OD Oe Pertaete 20.0230 106 
Ee Renee 21.9437 116 


The results secured by this method of constructing 
index numbers of prices will be compared shortly with 
results secured from the same data by other methods. The 
chief weakness of this type of index number is obvious. 
This is not an unweighted nor yet an equally weighted 
index. The influence of each commodity upon the result 
is dependent upon the price of the unit in which it happens 


INDEX NUMBERS OF PRICES 195 


to be traded. In the present index, hay, which is quoted 
by the ton, is given more weight than all the other 
11 commodities combined, with flaxseed second in im- 
portance. The index secured by adding the quotations is 
weighted in an entirely illogical fashion and cannot be 
accepted as reflecting the course of farm crop prices. 

One method which has been employed for avoiding the 
unequal weighting caused by the difference in units in 
which different commodities are traded is to reduce all 
quotations to the same unit. Thus hay, rice, corn, cotton 
and the other commodities might all be quoted by the 
pound, and these quotations added to secure the index. 
Yet this method, which has been employed in the con- 
struction of Bradstreet’s index, merely replaces one system 
of illogical weighting by an equally illogical one. Equal 
weight, if such is desired, is not given to all commodities by 
this method. Thus, in 1910 hay was worth $.00607 per 
pound, cotton $.141 per pound and rice $.015 per pound, 
cotton having a weight in an aggregate of per pound prices 
9 times that of rice and 23 times that of hay. 


ARITHMETIC AVERAGES OF RELATIVE PRICES 


Another method employed in the construction of index 
numbers involves the reduction of each quoted price to a 
relative, with reference to the price of the same commodity 
at a certain basic date, these relative figures then being 
averaged by any of the conventional methods. The follow- 
ing example illustrates the first phase of this process, data 
for two years being utilized. The year 1910 is taken as 
base. 


196 INDEX NUMBERS OF PRICES 


TABLE 46 


Computation of Relative Prices for the Construction of Index Numbers 


(1) (2) (3) (4) (5) (6) 

Commodity Unit Price, 1910 | Relative | Price, 1911 |Relative 

Comes cee Bu. $ .480 100 | $ .618 | 128.8 
Cotton wsice nies ee Lb. .141 100 .088 62.4 
Hay vce tore cioee Ton (short) 12.14 100 14.29 1B erp 
IWihtea tens. tersieare care Bu. .883 100 .874 99.0 
Oats sae cee aan ae Bu 344 100 450 130.9 
White Potatoes...... Bu 557 100 799 143.5 
Su alesse stan cketrciscleer- Lb 0393 100 0494 125.6 
Barley erasers criens Bu .578 100 869 150.2 
MobaccOn see oe eee Lb .093 100 094 101.1 
Rlaxseedigs cick seein cine Bu Poi 100 1.821 78.7 
IRVGrer seuss sates Bu aif (a 55 100 832 116.2 
Rice pr ise tates Bu 678 100 797 117.5 
1200 1371.6 


From these figures the arithmetic averages of relative 
prices in these two years may be readily computed. The 


/ 

formula for any single relative is a When there are N 
Po 

relatives the formula for the index number at time “1” is 


ae 


In the present case nM 


Index (1910) = “* = 100. 
Indes (ohne a 114.3. 


Index numbers computed in this way for the years 1910 to 
1923, inclusive, are shown in column (3) of Table 49. 


Werieuts Impuicituy Emptovep In SIMPLE 
AVERAGES OF RELATIVE PRICES 


This type of index number is usually termed an ‘‘un- 
weighted” index of relative prices. It is weighted, however, 


INDEX NUMBERS OF PRICES 197 


just as are the types illustrated in the two examples pre- 
ceding. ‘The quantity employed as weight in each case is 
the amount of each commodity which would sell for $100 
in the base year. In the preceding example the following 
quantities have been employed as weights: 


Oram on nee 208.3 bu 
COULOR ese 710.0 lbs 
Hayes 8.24 tons 
WWiteat ances sateen TiS:3> bu: 
Oats ins ks 291.0 bu. 
Potatoes’ @.0.6o.cn 180.0 bu. 
Sugar. . 2650.0 Ibs. 
Barleve. cer atc is 17322 but 
(LODACCO Mt aos sne 1076.0 Ibs. 
Flaxseed....... 43.2 bu. 
ASICS Fe eee eae 140.0 bu. 
ICO Cee aoe ee ahs 3 147.7 bu. 


What has been done, in effect, in the computation of the 
simple average of relative prices has been to determine the 
aggregate amount for which the above quantities would sell 
in each of the eleven years included. At 1910 prices each 
of the above quantities would sell for $100, the aggregate 
value being $1200; at 1911 prices the aggregate value of 
the above quantities was $1371.60. These aggregates, 
divided by 12, give the index numbers shown in column 
(3), Table 49: 100 for 1910, 114 (114.3) for 1911, ete. Thus 
the “unweighted average of relative prices” is in fact a 
weighted aggregate of actual prices. It is equally weighted 
in the sense that the value of the quantity of each com- 
modity employed as weight was equal to $100 in the base 
year, 1910.! 

Mep1ans oF RELATIVE PRICES 


The median rather than the arithmetic mean may be 
employed in securing the average of the relative prices for 


1 Attention was called to this characteristic of the simple average of rdlative 
prices by F. R. Macaulay, American Economic Review, Dec., 1915, 928. 


198 INDEX NUMBERS OF PRICES 


each year. When the relatives in column (6) of Table 46 
are arranged in order of magnitude the following distribu- 
tion is secured: 


62.4 IVed 
(tepert 125 .6 
99.0 128.8 
101.1 130.9 
116.2 143.5 
117.5 150.2 


The smallest relative price is 62.4, the greatest 150.2; 
the median value is 117.6. This median value is the index 
number for 1911. All the index numbers computed in this 
way from the medians of relative prices are presented in 
column (4), Table 49. 


GEOMETRIC AVERAGES OF RELATIVE PRICES 


The geometric averages of the relative prices for the 
various years may now be computed and the results com- 
pared with those secured in the preceding examples. A 


/ 
single relative being represented by the symbol as the 


Po 
formula for the geometric mean of N relatives is 


pi 
u, = V2 Br x Be, Do!” 


A geometric mean is generally Pence by the aid of. 
logarithms; in this case 


] (25) + low (2) 41 (B+... 
F 08 (F 08 0g (F | 
N 


The method of computation may be illustrated for the 
years 1910 and 1911. The relative prices of the various 
commodities are repeated from Table 46. 


INDEX NUMBERS OF PRICES 199 


TABLE 47 


Computation of Geometric Averages of Relative Prices 


(1) (2) (3) (4) (5) 


Commodity Relative Price, | Logarithm of | Relative Price, | Logarithm of 


1910 Fig. in col. (2) 1911 Fig. in col. (4) 

(Cees ik Sagas ene 100 2.0 128.8 2.10992 
Wottomisc. . 2... 100 2.0 62.4 1.79518 
LESS oceeneetcrere 100 2.0 Ll bef sf 2.07078 
WING Ua re ok 100 2.0 99.0 1.99564 
ALS iris cis 100 250 130.9 2.11694 
White Potatoes 100 2.0 143.5 2.15685 
SU GAT Peeclaisie- s,s 100 2.0 125.6 2.09899 
IBSTCY oases c e «.< 100 2.0 150.2 2.17667 
Tobacco....... 100 2.0 101.1 2.00475 
Flaxseed ...... 100 2.0 78.7 1.89597 
FOV Ere sve ais a2 20 © 100 2.0 116.2 2.0652] 
Race ineatee ns. 100 2.0 117.5 2.07004 

24.0 2A 55694 


24 


Log M, (1910) = Cag 2 
M, = antilogarithm of 2 = 100 
Log-Mf, (1911)'e ee 2.04641 


M, = antilogarithm of 2.04641 = 111.3. 


This value, 111.3, is the index number for 1911. The 
results for all the years are summarized in column (5), 


Table 49. 


Harmonic AVERAGES OF RELATIVE PRICES 


The characteristics of the harmonic average have been 
discussed in a preceding chapter. The reciprocal of the 
harmonic mean, it will be recalled, is the arithmetic mean 
of the reciprocals of the constituent measures. The con- 
stituent items, in the present case, are price relatives of the 

y / 
form poe The reciprocal of such a relative is ae The 
Po 1 


200 INDEX NUMBERS OF PRICES 


formula for the harmonic mean of N price relatives is, 
therefore, 


or 


=(2) 
Pi 
The method of computation is illustrated in the following 


table: 


TaBie 48 
Computation of Harmonic Averages of Relative Prices 


(1) (2) (Ci (4) (5) 
Relative Reciprocal Relative | Reciprocal 


Commodity price, of Fig. wn price, of Fig. in 
1910 col. (2) 1911 col. (4) 
(COTM nists, hereon ners See 100 .O1 128.8 .007763975 
Cottoner ce. seater ee 100 .O1 62.4 .01602564 
IEC hi ree ERS or Oba Oe 100 Ol Ua bear .008496177 
IWiheats aeeccict as croeeaat ener 100 Ol 99.0 -010101010 
Oatsaen vas tone element 100 ol 130.9 .007639419 
White Potatoes....... 100 Ol 143.5 .006968641 
Suvari nscaee sire Sabine 100 Ol 125 .6 .007961783 
Barley. tetrs stinks siseeies 100 ol 150.2 .006657790 
PRG DACCON;T. Moaecontict ear: 100 ol 101.1 .009891197 
laxseeditnwctesnacte eo ane 100 ol 78.7 .01270648 
UUV eat ab cohaeaierclva aibent aia 100 01 116.2 .008605852 
RICE Uns olkee yea Coane 100 01 PETES .008510638 
.12 . 111328602 . 
12 
H (1910) =—— = 100 
( ) 12 
12 
H (1911) = ———————— = 107.8. 
( ) . 1113828602 


The index numbers computed in this way for all the years 
included in the study are shown in column (6), Table 49. 


INDEX NUMBERS OF PRICES 201 


In the construction of the five types of index numbers 
explained above no attempt has been made to use a logical 
weighting system. All are termed ‘“‘unweighted”’ averages, 
a term which is quite misleading. The first index con- 
structed, based on aggregates of actual prices, is a heavily 
weighted index number, though the weights are illogical. 
In the next four the quantities employed as weights are the 
amounts purchasable for $100 in 1910. The five results 
may now be brought together and compared. In each case 
the index is given to the nearest whole number. 


TABLE 49 
Index Numbers of Farm Crop Prices, 1910-1923 
(1910 = 100) 
a} @ @) (a) (5) 6) 
Aggregates Arithmetic edie of Geometric Harmonic 
of actual averages of ; averages of | averages of 
Year : 5 relative 5 ¢ 
prices (as relative : relative relative 
relatives) prices pages prices prices 
1910 100 100 100 100 100 
1911 114 114 118 Tan 108 
1912 92 95 93 92 90 
1913 98 104 98 100 97 
1914 92 101 102 97 91 
1915 91 104 104 102 101 
1916 113 156 152 151 147 
1917 161 208 208 204 198 
1918 178 215 209 210 205 
1919 189 252 226 241 231 
1920 136 151 143 145 139 
1921 99 115 99 107 101 
1922 106 129 114 124 119 
1923 116 142 134 135 129 


These index numbers are plotted in Fig. 51. 


COMPARISON OF SIMPLE INDEX NUMBERS 


The four averages of relative prices agree much more 
closely with each other than with the index numbers based 


202 INDEX NUMBERS OF PRICES 


i 
Hj 


Legend 
— Aggregate of Actual Prices 


i 
: 
i 
é: 
< 
| 
| 


Fic. 51. — Comparison of Five Simple Index Numbers of Farm Crop Prices, 
1910-1923. (1910 = 100) 


on aggregates. For reasons already suggested the latter 
is quite untrustworthy as a measure of price changes. Of 
the other index numbers, the arithmetic, geometric and 


INDEX NUMBERS OF PRICES 203 


harmonic means show a consistent relationship, a fact which 
follows from the nature of the averages employed. Except 
in the base year the geometric mean is always less than 
the arithmetic and the harmonic is always less than the 
geometric, the amount of difference increasing as the dis- 
persion of prices becomes greater. The median, with only 
twelve items to be averaged, is somewhat unstable, and 
its relationship to the other ae is not always a con- 
sistent one. 

How are we to choose among these varying results? No 
one of these “unweighted”’ index numbers is perfect, for 
weights which have crept in do not measure the relative 
importance of the various commodities included in the 
index numbers. But, neglecting for the moment the 
question of weights, is it possible to test the adequacy of 
the different methods for measuring changes in the prices 
as given? 


Tue Time Reversat TEstT 


For this purpose Irving Fisher has employed what he 
terms the “time reversal test.” This is merely a test to 
determine whether a given method will work both ways in 
time, forward and backward. If from 1910 to 1911 sugar 
should increase from four to eight cents a pound, the price 
in 1911 would be 200 per cent of the price in 1910, and the 
price in 1910 would be 50 per cent of the price in 1911. One 
figure is the reciprocal of the other; their product (2.00 x 
.50) is unity. Similarly, if a given method of index number 
construction shows the general price level in one year to be 
200 per cent of the level in the preceding year, it should 
work correctly when reversed; it should show that the 
price level in the first year was 50 per cent of the price level 
in the second year. When the data for any two years are 
treated by the same method, but with the bases reversed, 
the two index numbers secured should be reciprocals of each 


204 INDEX NUMBERS OF PRICES 


other. Their product should always be unity. If it is not, 
there is an inherent bias in the method. 

This test may be applied to the methods employed above, 
using prices for 1910 and 1911. With 1910 as base the 


following results were obtained: 


Aggregates Arithmetic F Geometric Harmonic 
Medians of 
of actual averages of ‘ averages of | averages of 
AG prices (as relative relative relative relative 
relatwves) prices PES prices prices 
1910 100 100 100 100 100 
1911 113 .79414 114.3 117.6308 111.3 107.8 
and with 1911 as base: 
Aggregates Arithmetic Medians of Geometric Harmonic 
Vad of actual averages of slate averages of averages of 
prices (as relative prices relative relative 
relatives) prices prices prices 
1910 87 .87799 92.76 85.0117 89.85 87.47 
1911 100 100 100 100 100 


When the index numbers for 1911 in the first table are 
multiplied by the corresponding index numbers for 1910 
in the second table, we have the following values. (In 
securing these products the index numbers are put in the 
ratio, not in the percentage form.) 


Agareaates Arithmetic Medians of Geometric Harmonic 
of actual averages of abet averages of averages of 
price relative prices relative relative 

prices prices prices 
1.00 1.0602 1.00 1.00 9429 


This time reversal test is met by three of the methods 
employed. It is not met by either the arithmetic or har- 


INDEX NUMBERS OF PRICES 205 


monic averages. The former has a distinct upward bias, 
amounting to more than six per cent when the errors for 
1910 and 1911 are compounded, while the harmonic mean 
shows almost as large an error in the opposite direction. 
Unless the inherent bias which is found in both these 
averages is rectified in some way, methods based upon these 
averages should not be used in the construction of index 
numbers. 


Tue WEIGHTING OF INDEX NUMBERS 


Five simple index numbers of prices have been described 
in the preceding section. With the introduction of weighting 
the number of possible combinations is greatly increased, 
but only a few of these types need concern us here. 

In the construction of an accurate measure of price 
changes logical weights must be employed, weights which 
truly reflect the relative importance of the commodities 
included. If the weighting problem is ignored haphazard 
and illogical weights will inevitably be present, whether 
recognized or not. 

The data used in the preceding examples may be utilized 
to illustrate methods of weighting and to show the effects 
of varying weights upon the values of index numbers. 
The weights employed in constructing index numbers of 
farm crop prices may be either the quantities or values of 
the crops produced, depending upon the type of index 
selected. The quantities produced during the period 1910- 
1923 are given in the following table: ! 

1 For the period 1910-19 the figures are taken from “‘ An Index of the Physical 


Volume of Production,” E. E. Day, 1921, 8. (Reprinted from the Review of 
Economic Statistics, Sept., 1920.) For the years 1920-23 the data are from Weather, 


Crops and Markets. 


INDEX NUMBERS OF PRICES 


206 


“SOlIOS SITY} IO} SonTeA poyeultjsa uodn paseq o10M suorzeynduioo gz-[zEI sivok ay} 10 

«taf dod oy} Jo [ Aine Surmuiseq ved [eosy oY} I0j S97ye{S peu) snonsyu0d-uou wosy sysoduat snd ‘peyeoipur avak dodo 
oUF TOF S972IS powUy) shonsiyU0d ut ie8ns sues puv yooq jo uoryonpoid [e304 yueseider,, uoryonpoid zeBns soy soansy aE z 
‘JYsIOM ssois spunod ggg jo sajrg , 


96 6S 0°s9 SP LT CLPL 6 861 ae P Olt OSSL L°S8h 06 68 80'0OL FL0E Sé6L 
OF LF Fr SOL Lg OL LYST TL Osl om P'Eot OTL 9°L98 88°96 9L°6 9066 G66L 
19 LE 419 60'°8 OLOT 6° FST reera L196 8LOL 6 F18 86 68 S6°L 6908 1661 
L0°6¢ ¢°09 LL OL 68SL € 681 c99F 6 SOF LOFL 0 $68 98 °L8 PP SL 6066 0&6L 
90° IF 8F 88 66°8 06S L°S9L FE9S 6 LEE 8FSoL 0 16 6s 16 GF IL L6G 6161 
19°88 4016 LE $1 68h 6996 OGGF 6 °11P 8ég1 Vv 166 99'9L 50 SL E096 SI6T 
PLLE $669 9L'6 6FSL 8113 OL6E L GFP S691 L°969 1g 68 0g IT ¢906 LI6L 
98 OF 98 8h 0¢ FL SSI € O81 TLOF 0 'L86 OS6L 6°99 6116 cP IL L996 9I6L 
[6°86 G0 FS 60 FL 690L 8866 OSGP L696 6FST 8 °9S0L 66°98 6L IL 2666 SI6L 
£9 SG SL GP GL Sl SgoLl 0°S6L PELP 6° 60% TFIL 0168 240 °OL FL'9L E196 TIGL 
PL’ SS 88 LP S8° LL rg8 @ 8L1 Lh6E ¢ IEs SOIT ~ SOL GL 9 9L FL LPPS sI6L 
90°96 99°¢¢ 10°86 696 8° $66 POLE 9° O6F SLFL $ OSL 69 GL OL SL CoE 616L 
£6 336 ol && LE 61 S06 6 °09T 968F L666 666 § 169 66 FE 69 “ST IESG LI6L 
IS ¥3 06 FS GL SL SOLL 8°SL1 819g 0 6% 98IT 1 '¢¢9 8&6 69 19 IL 9886 OI6L 
Cah | em |e 
(-nq fo ("nq fo (-nq fo ((8q) fo ("nq fo z (897 fo ("ng fo (nq fo ("ng fo puoys fo fo ("nq fo 
SuoUTD) | suo) | suounyr) | suomeyt) | suounpr) | suomegt) | suounze) | suorezg) | suormrz) | suouprye) suotpryg) | SOME) | 402A 
avy ahyy Paasxvoyy 099090], haying “ping 209D]0q 8200 qoay 4 OULD J apts uULog—y 
any A “AH dD 


Sé6L-OI6L ‘sdoug anjan J “U0249N POL qpashyg jonuup 
0¢ aIavy, 


INDEX NUMBERS OF PRICES 207 


WEIGHTED AGGREGATES OF ACTUAL PRICES 


The thoroughly illogical results obtained when actual 
prices, as quoted, are totaled to secure an index number 
have been pointed out. The same objection cannot be 
made when the prices are appropriately weighted before 
the aggregate is taken. If for weights we employ the 
quantities produced in the base year (at time ‘‘0’’) the 
formula for the weighted aggregate is 


> pi go. 
Dz Po qo 


This is, in effect, the method employed by the United States 
Bureau of Labor Statistics, though the quantities are taken 
from a year other than the base year. The method is 
illustrated in the table to be found on page 208. 

The desired index numbers, in the form of relatives, may 
be computed from the aggregates secured by totaling 
columns (5) and (8). Either year may be taken as the base, 
and the price aggregate in the other year expressed as a 
relative on this base. With the 1910 aggregate as base the 
index for 1911 is 111.5. Index numbers similarly computed 
for the other years are given in column (2), Table 54. 

Another type of weighted aggregate may be constructed, 
with weights taken not from the base period but from the 
later period in the given comparison. That is, we may 
employ q: (quantity at time “1”’) as weight in comparing 
prices at time ‘‘1” with prices at time “0,”’ and employ q 
(quantity at time “2”) as weight in comparing prices at 
time “2” with prices at time “0”. Algebraically, the 
formula for the index number at time “1”’ is 


2 Pr. 
z Po U1 


The process of computation is precisely the same as in the 
preceding example, except that the weights are changed 


INDEX NUMBERS OF PRICES 


208 


ee eee ee eee eee eee 


068‘FE9°89L 9s 
OLFFES‘6L 
008°980°6% 
OBL‘S9L‘Ss 
000°889‘s0L 
00%°SS0‘IST 
008°63L°8LT 
000‘TS38‘82% 
000‘002L‘ss¢ 
O0FLLO‘S¢¢ 
008‘0FF 166 
000°0F8‘0L¢ 
000‘8Fo"EsL1$ 


(0b td) 
bra. X org 
(8) 


038°F6F S89 FS 


IG FS 
06 FS 
GL OL 
Soll 
8°SLT 
8L9°s 
06S 
981'T 
1°99 
88 69 
908°¢ 
988°S 


(suoyjrm ur ‘OT6L 
paonposd fijy 
-unng)) 1y6va 44 
(4) 


O8L‘LT9‘9L 
00S°S96 FS 
OFS OLE6S 
000°6L9°0L 
00F'99F‘00T 
OOF LET ‘SFT 
000‘S66‘F6T 
000‘F86°LOF 
008‘S6L°099 
006'SL8°SF8 
000‘S0¢‘818 
000‘088‘¢8s‘ I$ 


(0b od) 
O10. X 90d T 
(9) 


(suorrm we “OT6T 
paonposd fig 
~upngy) 7619.4 
(¥) 


sag qonjop fo sawbhaibby paybiay fo uoynndwog 


I¢ @Iavy, 


fippowwmo) 


(1) 


. The index numbers secured by 


th each successive year 
this method are g 


Wi 


lumn (3), Table 54. 
The weights in these two cases have been quantities, for 


iven in co 


terms 


mM 


ties, give aggregates 


d by quantiti 


1e 


° 


Itipl 


prices, mu 


INDEX NUMBERS OF PRICES 209 


of prices. But in weighting individual price relatives 
quantities will not serve. The abstract relatives must be 
weighted by values, if the resulting products are to be com- 
parable. For values are in terms of a common dollar unit, 
while quantities may be expressed in a variety of units. 
The values which are to be employed as weights may be 
derived in various ways. 

Fisher ! outlines the four following methods, of which the 
second and third are hybrid types: 


I. Each weight = base year price x base year quantity (po qo). 
II. Each weight = base year price X given year quantity (po m1). 
III. Each weight = given year price x base year quantity (p; qo). 
IV. Each weight = given year price x given year quantity (p; q1). 


Just as certain averages possess inherent bias, so a distinc- 
tive weight bias arises from each type of value weighting. 
(This inherent bias is absent from the quantity weighting.) 
A downward bias arises from weighting systems I and II 
(in which base year prices are used), while an upward bias 
arises from weighting systems III and IV (using prices in 
the given year). This is in part capable of mathematical 
demonstration? and has in part been established by numer- 
ous trials. 


1 The Making of Index Numbers, 54. 
2 An index weighted by type III must exceed an index weighted by type I. 
Weighting the price relative of a given commodity by type III, we have 


ce xX Pi Yo 
Po 

while by type I we have 
S X Po Yo 
Po 


If p, exceeds pp (if the price relative is above 100) the weight by type III (pio) 
is greater than the weight by type I (poqo). That is, all relatives above 100 are 
more heavily weighted by type III than by type I. But if ; is less than po the 
weight by type III (piqo) is less than the weight by type I (pogo). All relatives 
below 100 are less heavily weighted by type III than by type I. Thus the effect 
of all price increases is over-emphasized and the effect of all price declines is under- 
emphasized by type III, giving a net result always greater than type I. The 
same is true of type IV as compared with type II. As between types I and IV 


210 INDEX NUMBERS OF PRICES 


In the several examples next following we shall deal only 
with values of quantities produced in the base year, 1910. 
These values are given in column (3) of Table 52. For 
weighting purposes they are taken to the nearest million. 


WEIGHTED ARITHMETIC AVERAGES OF 
RELATIVE PRIcES 
In the computatioa of an index of this type, each relative 
is multiplied by the appropriate weight and the sum of the 
products is divided by the sum of the weights. The process 
is illustrated in the following table. 


TABLE 52 
Computation of Weighted Arithmetic Averages of Relative Prices 


(1) z ie (3) n . » (6) - v 
. elative , elative elative 5 elative 
Commodity Price, 1910 Weight Price X Wot. | Price, 1911 Weight Price X Wot. 

Corti ranean 100 $1,385 $138,500 128.8 $1,385 $178,388.0 
Cotton yen caets ana 100 819 81,900 62.4 819 51,105.6 
ay) Kates causctneoerae 100 842 84,200 aA beer 842 99,103.4 
Wheateisrccrsctecans 100 561 56,100 99.0 561 55,589.0 
Oats tien cen Geitee as 100 408 40,800 130.0 408 53,407.2 
White Potatoes..... 100 194 19,400 143.5 194 27,839.0 
Sugarieian «ss creas 100 142 14,200 125.6 142 17,835.2 
Barleyin cece sate 100 100 10,000 150.2 100 15,020.0 
Mobaece snectexarletere 100 103 10,300 101.1 103 10,413.3 
Plaxseed j.0..06<022% 100 29 2,900 78.7 29 2,282.3 
Rivenrocor aura aloes 100 25 2,500 116.2 25 2,905.0 
Rice ssenorsstes 100 17 1,700 117.5 17 1,997.5 

$4,625 $462,500 $4,625 $515,835.5 


(The weights employed are the values of the quantities produced in 1910, in millions.) 


: j : $462,500 
Weighted arithmetic mean (1910) = “$4,695 = 100. 

: ; : $515,835 .5 
Weighted arithmetic mean (1911) = 34.625. = 1 A ges 


there is no necessary relation, but in general an index weighted by type IV will 
exceed an index weighted by type I. Base year weighting involves a downward 
bias while given year weighting involves an upward bias. (For a more detailed 
discussion of bias in weighting see Fisher, The Making of Index Numbers, Chapter 
V and pages 384-387.) 


INDEX NUMBERS OF PRICES 211 


This value for 1911, it will be noted, is identical with that 
secured from the computations illustrated in Table 51. That 
index is a weighted aggregate of actual prices, the weights 
being the quantities produced in the base year. An arith- 
metic mean of relative prices, weighted by values in the base 
year, is always equal to a relative constructed from such an 
aggregate.} 


WEIGHTED GEOMETRIC AVERAGES OF 
RELATIVE PRICES 


The process of computing the weighted geometric mean 
is identical with that of computing the unweighted geometric 
mean, except that the logarithm of each relative is multi- 
plied by the given weight and the sum of these weighted 
logarithms is divided by the sum of the weights, the result 
being the logarithm of the desired index.2, The method is 
illustrated in the following table: 


1 This may be readily demonstrated algebraically. The value of any com- 


modity in the base year is pogo, while the price relative for a second year is Pe 
Po 
The weighted mean of such price relatives is equal to 


wr my 
ur vt 


pi Pr Pi 
saw qo Po’ ten ON PONTO tami oN pogo” stout a 
Po Po Po 


TTA 


pogo’ + pogo’ + pogo” +... 
which reduces to 


2 P19 
2 pogo 


a weighted aggregate of the type mentioned. 


In the same way the harmonic mean, weighted by full values in the second 
year, reduces to 
Z po 


This has already been enccuntered as an aggregate of actual prices weighted by 
quantities in the second year. 
2 The formula for the weighted geometric mean is given on page 136 above. 


212 INDEX NUMBERS OF PRICES 


TABLE 53 
Computation of Weighted Geometric Average of Relative Prices, 1911 

(1910 = 100) 
5 : Weight Logarithm of 
Commodity Relative oe of (value pro- Relative Price 

Price, 1911 elative Price | ced in 1910) 5 weil 

Comenas ace 128.8 2.10992 1,385 2922 .23920 
Cottoneesoa ee 62.4 1.79518 819 1470 . 25242 
Hayese rae Moleee ce 2.07078 842 1743 .59676 
Wheat seein 99.0 1.99564 561 1119 .55404 
Oates. ero. 130.9 2.11694 408 863 .71152 
White Potatoes . 143.5 2.15685 194 418 .42890 
URAL eee: 125.6 2.09899 142 298 .05658 
Barleyeryacrircre 150.2 2.17667 100 217 .66700 
TODACCO. sae 101.1 2.00475 103 206 .48925 
Flaxseed ....... Uhsdth 1.89597 29 54 .98313 
RVs smeree cee 116.2 2.06521 25 51.63025 
IRicews Wen aas 117.5 2.07004 ie 35 . 19068 
4,625 9401 .79973 

9401 .79973 

Log M, = Sets to 2.032822 
M, = 107.9 


The index for 1911 on the 1910 base is 107.9. Values 
secured for all the years of the period covered are given in 
column (5), Table 54, together with the other weighted 
index numbers already explained. 

How are we to judge of the relative merits of these three 
index numbers? We may, first, apply the time reversal 
test which was employed in comparing the five simple index’ 
numbers. This test is not met by any of the weighted types 
we have constructed. The geometric is equally at fault 
with the others. Though the simple geometric meets the 
test, the introduction of weighting imparts a bias to the 
result. Judged by that test alone none of the three is 
satisfactory. We may next try the second fundamental 
test which Fisher has developed, which is termed the 
**factor reversal test.” 


INDEX NUMBERS OF PRICES 213 


Tue Factor REvEerRsAL TEST 


The total value of any commodity in any year is, of course, 
the product of the quantity produced and the price per 
unit; algebraically, it is equal to po’ go’. The ratio of the 
total value in one year to the total value in the preceding 
pi qu 
Do qo 
and quantity should double, the price relative would be 200, 
the quantity relative 200 and the value relative 400. The 
total value in the second year would be four times the value 
in the first year. The value relative would be equal to the 
product of the price and quantity relatives, a relationship 
which is obvious in the case of a single commodity. 

If, for a number of commodities, we construct an index 
of the price change from one year to the next and an index 
of the quantity change from one year to the next, we should 
expect their product to be equal to the ratio of the total 
values in the second year to the total values in the first 
year. If the product is not equal to the value ratio, there 
is an error in one or both of the index numbers. 

As an illustration, we may apply this test to the first 


year is If, from one year to the next, both price 


aggregative index constructed (2%). An index of 
Z Po Yo 


quantities may be computed from this same formula, merely 
interchanging the q’s and the p’s; the formula becomes 

2 1 Po. 

Z qo Po 
The same price factor appears in numerator and denom- 
inator, as we desire to measure only the effect of the quan- 
tity change. Substituting the given values of the twelve 


farm crops we have 
sh Ree $4,446,264,630 
Quantity index, 1911 (1910 =a 100) a $4, 625,494,820 = .96125. 


In percentage form the index of quantities produced in 


214 INDEX NUMBERS OF PRICES 


1911 is 96.125, with 1910 as base. The corresponding 
price index, by the same formula, is 111.5. The product 


.96125 x 1.115 = 1.0718. 


That is, if prices have increased 11.5 per cent, while quan- 
tities have decreased 3.875 per cent, the total value should 
show an increase of 7.18 per cent. 
For the value ratio we have 
2pigi $4,748,718,320 


= ———_—_-_—__ = 1.02664. 
>>) Po Yo $4,625,494,820 i 


There is a discrepancy here of 43 per cent. This formula 
does not meet the factor reversal test, and cannot be 
accepted as satisfactory. 

When this test is applied to the second aggregative index 
we secure the following values for 1911, with respect to 
1910 as base: 


Price index = eBid 106.8. 
Po Q1 


C 5 z qi Pi 
uantity index = —~— = 92.05. 

Q 1lLy >> qo Pi 
Product = .9205 x 1.068 = .9831. 


(In securing the product the index numbers are put in 
the ratio, not in the percentage form.) 


Here is an error in the other direction of over four per cent. . 


The weighted geometric average also fails to meet this 
fundamental factor reversal test. With respect to both the 
geometric index and the aggregates we have, apparently, 
by the introduction of weights spoiled index numbers which 
in their simple form were unbiased. Yet weights we must 
have, if the index numbers are to represent the facts 
accurately. Neither a simple index nor a weighted form 
of a simple index will meet the two tests laid down as 


INDEX NUMBERS OF PRICES 215 


fundamental. Professor Fisher tested 46 such formulas, 
of which only four (the simple geometric, median, mode 
and aggregative) met the time reversal test, and none met 
the factor reversal test. 


Tue “IpEeaL” INDEX 


A way out of this difficulty is offered by the possibility 
of “‘rectifying”’ formulas by a crossing process, by averaging 
geometrically formulas which err in opposite directions. 
Professor Fisher has made exhaustive trials of all possible 
formulas by this process, finding thirteen formulas in all 
which met both tests. Of these he has selected one as 
“ideal,” from the viewpoint of both accuracy and simplicity 
of calculation. This ideal index is the geometric mean of the 
two aggregative types illustrated above. Its formula ! is 


ZP% 2H, 

Z Po Go 2 Po M1 
This index may be computed readily, in the present 
instance, from the results already obtained. Thus for 1911 


we have 
Ideal index = V111.5 x 106.8 


= 109.14 
This index number meets both the time reversal and the 
factor reversal test. Applying the former: 


Index of prices, 1911 (1910 = 100) = 109.14 
Index of prices, 1910 (1911 = 100) = 91.63 
1.0914 x .9163 = 1.00 


For the factor reversal test, applied to the data for 1911, 
(with 1910 as base) we have 


. 2% 2PM a 
Index of prices = Vi aT 4 ane 109.14. 


1 The same formula was developed independently by Bowley, Pigou, Walsh 
and Young. See The Making of Index Numbers, xv, 240-242. 


216 INDEX NUMBERS OF PRICES 


2q1 Po. 2G P1 
———— = 94.07. 
= qo Po Z qo Pi 


Index of values = ae = 102.664. 
Po Yo 


Product of price and quantity indices = 1.0914 x .9407 = 1.02668 


Index of quantities = 


The ideal index, the two weighted aggregates which enter 
into its construction and the geometric mean weighted by 
values in the base year are given in the following table, for 
the years 1910 to 1923. The index numbers are plotted in 
Fig. 52. 


TABLE 54 
Comparison of Weighted Index Numbers of Farm Crop Prices, 
1910-1923 ! 


(1) . (2) 4 (3) Pe 2 . (5) 
ggregative ggregative eal index : ; 
(Weighted by (Weighted by | Geometric mean Weaghied Cone 
Year 3 ; eee a Average (Weighted 
base year quanti- given year of indices in atk I 
ties) quantities) | cols. (2) & (3) | °y base year values) 


WaAwWowWwmnrnnonReRaO 


HM OAWEWWMODDOOO 


WOADWMOWMORWHS 
WASCHNIWAISOWHOOHS 


The wide discrepancies which were found between the 
various simple index numbers do not appear when the 
1 Values for the two sets of aggregative index numbers for the years 1910 to 


1919, inclusive, were computed by W. M. Persons. (‘‘Fisher’s Formula for Index 
Numbers,” Review of Economic Statistics, Prel. Vol. III, 107.) 


INDEX NUMBERS OF PRICES Q17 


weighted indices are compared. There are significant 
differences, but there is none of the erratic behavior of some 
of the simpler forms. 

Of these four types the “ideal” index undoubtedly serves 
as the best measure of the average price change between 
1910 and each of the given years. It is designed, it should 


___ Weighted Geometric 


By peared) 

125 | 
hea ACHE 
Brea rye pr 


S 
a 5 
Fie. 52. — Comparison of Four Weighted Index Numbers of Farm Crop Prices 
1910-1923. (1910 = 100) 


be remembered, to measure the change between two stated 
times, and not for intermediate comparison. The value of 
the index for 1923, for instance, is determined by the 
relation between prices and quantities in 1910 and in 1923. 
There is double weighting and the weights vary from year 
to year. If 1923 is to be compared with 1922 a new index 
is needed, in which the prices and quantities for 1923 and 
1922 alone are included. Direct comparison on the basis 
of the values for the “‘ideal’”’ index given in the above 


218 INDEX NUMBERS OF PRICES 


table is liable to error, because of the weighting system 
employed. 

It is one of the merits of the geometric mean with constant 
weights that it permits the index for each year to be com- 
pared directly not only with the base year index, but with 
the index for any other year. The base may be shifted 
directly from the relatives, and the same result will be 
secured as if the computation were made from the original 
data. If this same system be followed with the “‘ideal’’ 
index no large errors may be expected, but strict accuracy 
will not be secured.! 

The chief obstacles in the way of general adoption of the 
ideal index arise from the difficulty of obtaining annual or 
monthly quantities to use as weights, and from the time 
involved in its computation. Where accuracy is essential 
the latter is not a serious difficulty. As a substitute formula 
which is much more quickly calculated Fisher has proposed 


2 (qo + 1) Pr. 
Z (qo + 41) Po 


This formula, which has also been recommended by 
Edgeworth and Marshall, is considered by Fisher to be 
di é ; 

the best practical all-around formula, taking all four 
points into account — accuracy, speed, minimum legitimate 
circular discrepancy, simplicity.”” Results from this formula 
will generally differ from those secured from the “ideal’’ 
formula by less than one fourth of one per cent. The 
following table illustrates the method of computation, data 
for 1910 and 1911 being employed. 

1 Tf year to year comparison be a primary aim in a given instance, the “ideal” 
index may be constructed on the chain system. Link index numbers are first 
constructed, each year serving as base for the computation of the index for the 
succeeding year. These links may then be “‘chained” with reference to a fixed 
base. Warren M. Persons has shown that the errors involved in following this 


method are cumulative, and may be serious if the links are chained for a number 
of years, 


INDEX NUMBERS OF PRICES 219 


TABLE 55 
Computation of Aggregative Index, Weighted by Combined Quantities 


(1) (2) (3) (4) (5) (6) (7) 
Pyne Quantity 1910 Price 1910 X sum Pic Price 1911 & sum 
Commodity | Unit 1910 + quantity 1911 of quantities 1911. of quantities 
(in millions) col. (3) X col. (4) col. (6) X col. (4) 
(Cava ars bu. | $ .480 5,417 $2,600,160,000 $ .618 $3,347,706,100 
Cotton ... Ib. 141 13,650 1,924,650,000 . 088 1,201,200,000 
HS yicec< ton 12.14 124.30 1,509,002,000 14.29 1,776,247,000 
Wheat.... bu. . 883 1,256.4 1,109,401,200 .874 1,098,093,600 
Oats.......... bu. . 344 2,108 725,152,000 . 450 948,600,000 
Potatoes. . bu. 657 641.7 357,426,900 . 799 512,718,300 
Sugar.... lb. . 0393 7,914 311,020,200 . 0494 390,951,600 
Barley... . bu. .578 334 193,052,000 . 869 290,246,000 
Tobacco. . lb. . 093 2,008 186,744,000 . 094 188,752,000 
Flaxseed. . bu. 2.317 32.09 74,352,530 1.821 58,435,890 
IRVes ees bu. 715 68.02 48,634,300 . 832 56,592,640 
Rice...... bu. . 678 47.44 32,164,320 797 37,809,680 
a 
$9,071,759,450 ah $9,907 ,352,710 
| 


Digo+q)p1  $9,907,352,710 


(qo + 91) Po $9,071,759,450 
= 109.2 (index for 1911 on 1910 base). 


(The index is here expressed in percentage form.) 

This formula requires the same data as the “ideal” 
index, and these are not generally to be had. Usually it is 
only possible to secure comprehensive quantity figures 
at each census period, and for the intervening years con- 
stant weights must be employed. In such cases the 


weighted aggregative ‘ 
Zp qo 
2 Po Yo 


is probably the most generally useful type. The weighted 
geometric has many virtues, but is subject to a definite 
weighting bias. If no weights can be secured, or even 
approximated, the simple geometric and the simple 
median are far better than any of the other simple types. 
The geometric mean is more generally useful than the 
median. 


220 INDEX NUMBERS OF PRICES 


Tue RELIABILITY OF DIFFERENT INDEX NUMBERS 


An index number of prices is always based upon the study 
of a sample, the result being taken as representative of the 
entire field of prices from which the particular sample was 
drawn. Some method is needed, therefore, by which we may 
judge of the reliability of the different types of index num- 
bers, of their probable stability when computed from a 
number of successive samples. Some differences might be 
expected between index numbers based upon different 
samples. With which type of index number would these 
differences due to fluctuations of sampling be least? 1 

Truman L. Kelley? has attempted to measure the prob- 
able errors of the chief types of index numbers and has 
graded these types on the basis of excellence in this respect. 
Two index numbers, the weighted geometric mean and the 
weighted median, are given the highest grade, as being the 
most reliable, the least affected by fluctuations of sampling. 
Fisher’s “ideal”? index is ranked somewhat lower, though 
above the weighted arithmetic and harmonic averages of 
price relatives. The simple unweighted arithmetic average 
of relatives is given the lowest rating in the list. 

For reliability, flexibility and general excellence Kelley 
selects the weighted geometric mean as the best type of 
price index number. A ratio of aggregates 


=(pi w) 
2 (po wv) 


with selected weights (not necessarily precisely equal to the 
quantities marketed or consumed) is given a total score, 
based on the essential requirements of a good index number, 
as high as that of the weighted geometric mean and higher 
than that of the “ideal” index. Weights other than actual 


1 The subject of sampling, in relation to the reliability of statistical measures, 
is discussed in greater detail below. 
2 Statistical Method, 334-346. 


INDEX NUMBERS OF PRICES 221 


quantities are used in order that there may be flexibility 
in the matter of weighting. 


OTHER PROBLEMS INVOLVED IN THE CONSTRUC- 
TION OF Prick InpDEx NUMBERS 


The preceding section has dealt with the technical 
problems connected with the averaging of a given set of 
data in order to secure an index number of price variations. 
Certain methods have been shown to be quite faulty, while 
certain others have been found to be appropriate for given 
purposes. One who would use index numbers with intelli- 
gence should understand fully the methods which have 
been employed in securing given results, in order that he 
may know precisely what the given figure is designed to 
measure and what degree of reliability attaches to it. 

Such problems as these are not the only ones which 
confront those who construct index numbers, nor are these 
considerations the only ones which users of index numbers 
should bear in mind. Of equal importance with problems 
of averaging and weighting are the practical questions 
connected with the selection of representative samples. 
The only completely accurate measure of the general level 
of commodity prices would be secured by determining the 
ratio between all money units (including credit) in cir- 
culation and all the physical units of goods exchanged for 
money over a given period. The measurement of general 
price changes between two periods would thus involve 
complete knowledge of these two factors for each of the 
two periods. Such knowledge, of course, can not be had, so 
recourse must be had to the method of sampling. And 
primary importance attaches to the number of commodities 
and the character of the commodities upon the prices of 
which a given index number is based. 


222 INDEX NUMBERS OF PRICES 


NuMBER oF COMMODITIES TO BE INCLUDED 


Here again we are confronted with a relation which has 
already been mentioned, the relation between methods and 
uses. Decision as to the number of commodities and the 
kinds of commodities to be included in a given case must 
rest upon the purpose for which the index is to be con- 
structed. Assuming that the index number is to serve as a 
measure of general changes in the price level, the ques- 
tion as to the number of commodities to be included may 
be easily answered —the larger the sample the more 
representative will be the results. The frequency polygon 
based upon a large sample will approach more closely to the 
ideal curve which would represent all price quotations 
than will that based upon a small sample. Thus, as a 
measure of general price changes, more confidence may be 
placed in the Bureau of Labor Statistics index, which is 
based upon 404 price quotations, than in Bradstreet’s, 
which is based upon 96 quotations, though the latter has 
particular virtues of its own. Yet index numbers based 
upon a small number of quotations may not be ruled out as 
without value. Wesley C. Mitchell, whose researches have 
materially increased our knowledge of the price system and 
of the characteristics of index numbers, has compared in 
detail index numbers based upon varying numbers of quo- 
tations. Unexpected similarities are found. Those con- 
structed from a limited number of quotations reflect the 
broad movements of prices in much the same way as do 
those based upon the prices of several hundred commodities. 
In important details there are differences, however, differ- 
ences which may involve doubt as to the movement of 
prices in a given year. In such cases the index numbers 
based upon many quotations must be accepted as the 
more representative of general price movements, provided 
that the commodities included be equally representative 
of the various elements in the price system. 


INDEX NUMBERS OF PRICES 223 


For other purposes, however, index numbers based upon 
a limited number of quotations may be preferable. This 
is particularly true when a “‘sensitive”’ index is desired, one 
which will serve as a forecaster of general price movements 
rather than as a precise measure of changes in the general 
price level. Of this type is the index constructed by the 
Federal Reserve Bank of New York, formerly based upon 
quotations of 12 basic commodities (raw materials), and 
now upon quotations of 20 commodities. Somewhat 
similar in character is the index of wholesale prices con- 
structed by Warren M. Persons, which is based upon only 
10 price series. The object in this case has been to secure 
a price index of business cycles, a measure of the swings of 
business through phases of prosperity and depression. 
When a sensitive index is required the object is attained by 
the selection of a limited number of commodities the prices 
of which are subject to extreme fluctuations, rather than by 
the inclusion of a great many commodities. Yet the uses 
to which an index of this type may be put are limited. 
The “‘sluggishness”’ of the many-commodities index number 
is a sluggishness which inheres in the price system, and 
which must be reflected in a faithful index of general prices. 

The question of the number of commodities to be included 
can not be discussed apart from that of the character of 
the commodities upon the quotations of which an index 
number is based. The representative character of an index 
number rests in part upon the number of price series in- 
cluded, but the nature of these series is of even greatet 
importance. In the selection of price quotations the fact 
must be recognized that there are characteristic differences 
in the behavior of the prices of different commodity groups. 
These groups of prices, their inter-relations, their behavior, 
their relation to the functioning of the economic system 
and to the swings of prosperity and depression, are matters 
of immediate and practical importance to economists and 
business men. 


224 INDEX NUMBERS OF PRICES 


ELEMENTS OF THE PRICE SYSTEM 


It is impossible to give, by graphic means, a true concep- 
tion of the complicated inter-relations which prevail within 
the price system. In Fig. 53 certain of the broader relations 
only are brought out and some of the more important 


PRICES OF THE FACTORS OF PRODUCTION 


LAND 
VALUES 


PRICES OF COMMODITIES 


PRODUCERS’ 
GOODS 


WHOLESAL 
nes MANU $ 
FACTURER 
peti PRICES 
CONSUMERS 
GOODS JOBBERS & 
WHOLESALERS’ 


Fic. 53. — A Graphic Representation of the Relation between Certain Elements 
of the Price System 


elements are indicated. The broad subdivisions are prices 
of the factors of production, prices of commodities at 
wholesale, prices of commodities at retail, and prices of 
services, including those rendered by private, profit-seeking 
enterprises and those rendered by governmental units. The 
prices of private services come in part within the category 
of wholesale prices, and in part, in so far as they represent 
final services to ultimate consumers, within that of retail 


INDEX NUMBERS OF PRICES 225 


prices. Within each of these subdivisions there are impor- 
tant groups of related prices. Within the first subdivision, 
that of the prices of the factors of production, it is found that 
rents and land values, salaries and wages, discount rates 
and interest rates, bond prices and stock prices, all have 
their characteristic modes of behavior during the upward 
and downward swings of prices. In the field of commodity 
prices the same differences are found and, again, the 
various services rendered by private and public bodies are 
characterized by price changes peculiar to themselves. 

All the lines of relationship existing between the different 
units and groups in the price system can not be indicated 
graphically. Were this possible, lines would be seen to 
radiate from every unit and every group to every other unit 
and group. Certain of these relationships are direct and 
significant, while others are tenuous, impossible to trace in 
practice. 

Nor is the representation of the price system by a plane 
surface a just one. This picture shows it at a moment of 
time. Stretching backward are countless ties, connecting 
every price with preceding prices, and reaching forward are 
more ties, for future prices and price relationships derive 
directly from present prices. ‘Thus, for example, it is 
obvious that the prices of manufactured goods today have 
been affected by the price of raw materials, labor, etc., in 
the past. 

An accurate index of the purchasing power of money or 
an accurate measure of changes in the general price level 
should include representatives of all these groups and 
subgroups within the price system. No index currently 
published does this. Certain index numbers of wholesale 
prices purport to serve as measures of general price changes, 
but they can not be accepted as such. 

We are at present concerned with the construction of 
index numbers of wholesale prices. What has been said 
should serve to emphasize the true scope and significance of 


226 INDEX NUMBERS OF PRICES 


such index numbers and to indicate that they measure 
changes in but one element, though perhaps the most impor- 
tant, in the price system. It remains to consider some of 
the different groups of related prices within this division 
of the price system with particular reference to the problem 
of index number construction. 


Pricz GROUPS IN THE FIELD oF WHOLESALE 
PRICES 


Since an index number of wholesale prices must rest upon 
sample quotations, the sample must be representative, must 
include commodities whose prices are typical of the various 
elements in the price system. The division into elements 
for this purpose must be based upon the character of the 
price changes peculiar to the different groups. Of the 
groups thus distinguished, the most obvious are those 
representing different industries. Textile prices and steel 
prices, leather prices and the prices of chemicals are subject 
to different influences. Trade depressions and revivals do 
not affect all industries at the same time or in the same way, 
so that an index of wholesale prices must include quotations 
from all important industrial groups. If preponderant 
influence upon an index is exerted by the prices of certain 
types of commodities, the index, by that much, loses its 
representative character. Thus Bradstreet’s index, it has 
been established, gives greater weight to cotton fabrics, 
hides and leather, and cured meats than is justified by their 
actual importance in trade, a fact which does not detract 
from its utility for some purposes but which lessens its 
value as a representative index of wholesale prices. 

The extent of these differences between the price move- 
ments of commodities in different industrial groups may be 
appreciated by comparison of the index numbers of whole- 
sale prices of farm products and house furnishing goods 
since 1913. 


INDEX NUMBERS OF PRICES 227 


In order that an index may be representative it is not alone 
sufficient that all industries be given an appropriate number 
of representatives in the sample. Raw materials and 
manufactured goods show characteristic differences in their 
fluctuations, and fitting representation must be given to 
each of these groups. Prices of the former are, in general, 
more sensitive to changes in business conditions, their 
movements preceding those of manufactured goods and 
showing more violent fluctuations. There are two related 
reasons for these facts. Raw materials are traded in for 
purposes of manufacture and sale. When business improves 
after a period of depression, increased demand on the part 
of consumers (or expected increase in demand) leads com- 
peting manufacturers to bid against each other for raw 
materials. It is in the raw material markets that the 
pressure of increased demand first centers, and this bidding 
causes prices to rise in these markets before the prices of 
other goods are affected. (This is not an invariable rule, as 
was demonstrated during the revival of 1919.) Similarly, 
a period of crisis and liquidation begins with large stocks 
of manufactured goods on hand, and at the first evidence 
of slackening trade demand for raw materials falls off. 
Business forces pure and simple play in the raw material 
markets with more freedom than in the markets for manu- 
factured goods. Hence the tendency of prices in these 
markets to anticipate, in their movements, prices in other 
commodity markets. 

The second reason for the greater stability of prices of 
manufactured goods is found in the fact that these prices 
include a greater percentage of stable cost factors, namely, 
overhead charges and labor costs. Wages, interest, rents 
move more slowly and less violently than do commodity 
prices. The inclusion of these elements in commodity 
prices tends to render these prices more stable. Therefore, 
as commodities move forward from the raw stage to their 
final manufactured condition their prices include more and 


228 INDEX NUMBERS OF PRICES 


more of these stabilizing elements, and become less violent 
in their fluctuations.:. Important differences between 
different classes of manufactured goods arise from this 
fact. 

Each of the groups last mentioned contains minor groups 
of commodities with distinct price characteristics. Within 
the raw material group there are marked differences between 
agricultural products, animal products, forest products 
and mineral products with respect to price movements. 
Agricultural products are affected by weather and crop 
conditions as well as by business conditions, and, though 
subject to price fluctuations of some magnitude, reflect 
prevailing business conditions less accurately than do the 
prices of mineral products.? Animal and forest products 
appear to stand between these two with respect to the 
faithfulness with which they reflect business conditions in 
their price movements. ‘Thus, in selecting raw materials 
for inclusion in a sample of price quotations from which a 
representative index number is to be constructed, fair 
weight must be given to these various classes.? 

Manufactured goods, again, do not constitute a single 
homogeneous group with respect to their price movements. 
In so far as they are to be used for further production, or to 
undergo further manufacture, they resemble raw materials 
in relation to the bidding of competing manufacturers, and 
their prices, therefore, are characterized by relatively wide 
oscillations. In so far as the demand for them is for the pur- 
pose of final consumption, purely business forces have less’ 
weight and their prices are more stable. Related to this 


1 Cf. Mitchell, “The Making and Using of Index Numbers” (Bulletin 284, 
U.S. Bureau of Labor Statistics), 44-45, for examples. 

2 It should not be inferred from this that there is no relation between agri- 
cultural production and the prices of agricultural products, and general business 
conditions. Good reasons have been advanced for believing that there is a direct 
causal relation between agricultural and business conditions. The immediate 
price relation, however, is frequently one of contradictory movements. 

3 Cf. “The Making and Using of Index Numbers,” 47. 


INDEX NUMBERS OF PRICES 229 


argument is that which has already been presented, the 
increasing stability of prices as the stable elements of wages 
and overhead charges bulk larger in commodity costs. So, 
again, the sample price quotations from which an index of 
wholesale prices is to be constructed must include prices 
representative of producers’ and consumers’ goods, of goods 
in the intermediate as well as the final stages of manufac- 
cure. 

Other divisions of the price system exist, but those 
indicated above are the most important from our present 
point of view. A representative index number of wholesale 
prices should be based upon price quotations drawn from 
all the groups indicated, with weight given to each group 
in proportion to the relative importance in trade of the 
commodities in that group. 


AMERICAN INDEX NUMBERS OF WHOLESALE PRICES? 


InDEX NUMBERS OF THE UNITED STATES 
BurEAvU oF LABOR STATISTICS 


The authoritative index of wholesale prices in the United 
States is that compiled by the United States Bureau of 
Labor Statistics. This index was first constructed in 1902, 
and was continued until 1913 as an unweighted average of 
relative prices, the base of each relative being the average 
price of the given commodity during the ten year period 
1890-1899. The report for the year 1914 marked a material 
change in method, with a corresponding change in the base 
period employed in preparing the index for publication. 
As it at present stands the index for any given period 

' Ibid., 45, 49. 

2 Detailed descriptions and comparisons of these index numbers will be found 
in Bulletin 284, U.S. Bureau of Labor Statistics, Index Numbers of Wholesale 
Prices in the United States and Foreign Countries. The current bulletin of the 


Bureau of Labor Statistics on “‘ Wholesale Prices” gives detailed information as 
to the character of the wholesale price index of that department. 


230 INDEX NUMBERS OF PRICES 


(month or year) is a weighted aggregate of actual prices, 
the aggregate being expressed, to facilitate comparison, as 
a relative with 1913 as the base. 

The index is now based upon 404 price series. (A single 
commodity may be represented by several quotations, the 
prices for different grades or in different markets being 
given. Thus for raw cotton there are two quotations, 
Middling, New Orleans, and Middling Upland, New York.) 
The price of each commodity is multiplied by a constant 
weight, the quantity marketed in 1919. 

The following illustrates the method as applied to cotton: 


Average 4 Average Price, 1919 
C di Price, Quantity marketed < Quanity masked 
CROCS TY, 1919 in 1919 cates 
m 1919 
(per lb.) 
Cotton, Middling, 
New Orleans........ $.319 3,806,921,000 lbs. $1,214,407,799 
Cotton, Middling 
Upland, New York. . 325 1,903,461,000 lbs. 618,624,825 


When this process is carried out for the entire 404 price 
series included, the sum of the values in the last column 
gives the index number for the given period, in this case the 
year 1919. As published, this sum is expressed as a relative, 
the sum in 1913 representing 100. The method of con- 
struction renders it possible to shift the base to any desired 
year or month, changing the given relatives to percentages - 
on the new base. 

This index number, therefore, is based upon the cost at 
wholesale of a bill of goods. The bill of goods remaining 
the same, the total cost changes as the prices of the various 
commodities change, and the index measures the effect of 
these changing individual prices upon the total cost. Sub- 
totals are found for nine commodity groups, and index 
numbers for these groups are published, as well as for the 


INDEX NUMBERS OF PRICES 231 


complete list of commodities. The commodity groups are 
as follows: ! 
Farm products. 
Foods. 
Cloths and clothing. 
Fuel and lighting. 
Metals and metal products. 
Building materials. 
Chemicals and drugs. 
House furnishing goods. 
Miscellaneous. 


InpDEXxX NUMBERS OF THE FEDERAL 
RESERVE BOARD 


The Federal Reserve Board publishes two different sets 
of index numbers of American wholesale prices. The first 
of these is based upon the data compiled by the Bureau of 
Labor Statistics, regrouped in order that certain elements of 
the price system may be more effectively studied. The 
weights and methods employed are those of the Bureau of 
Labor Statistics. The following are the groups for which 
index numbers are compiled by this Board: 


Raw materials. 
Crops. 
Animal products. 
Forest products. 
Mineral products. 
Producers’ goods. 
Consumers’ goods. 


The utility of this classification was first demonstrated by 
W. C. Mitchell. Its significance has been explained in the 
preceding pages. The group index numbers were first 
published in the Federal Reserve Bulletin for October, 1918, 


1 For lists of the commodities included in the several groups, see the latest 
bulletin on “ Wholesale Prices,’’ U.S. Bureau of Labor Statistics. 


232 INDEX NUMBERS OF PRICES 


which also contained a list of the articles in each commodity 
group. 

For the purpose of facilitating international comparison 
of price movements the Federal Reserve Board has begun 
the construction of a series of index numbers of wholesale 
prices in the chief commercial countries. This work was 
undertaken because of the wide variations in method and 
in the character of the commodities included in the construc- 
tion of price index numbers in different countries, variations 
which made comparison difficult. At present these inter- 
national price index numbers are being currently published 
for the United States, Canada, England, France and 
Japan. 

The general method of securing these index numbers is 
that which has been described above in connection with the 
index of the Bureau of Labor Statistics. Money aggregates 
are secured by multiplying the price of each commodity by 
an appropriate weight, and getting the total of the indivi- 
dual products. These money aggregates are expressed as 
relatives, the aggregate for the year 1913 representing 100. 
The indices differ from that of the Bureau of Labor Statistics 
in the number of commodities included, in the grouping 
system employed and in the weights adopted. 

Each of the international price indices is based upon 90 
to 100 price quotations, representing about 70 different 
commodities. A double system of classification is employed. 
One system of grouping follows that already explained, a 
division into raw materials, producers’ goods and con-- 
sumers’ goods. The same commodities are again classified 
into “goods produced” and “goods imported.” Finally, an 
index of “‘goods exported’’ is constructed. The commodi- 
ties, the prices of which are used in this index, are drawn 
from the groups previously mentioned, those of importance 
in the export trade being selected. These six group index 
numbers, together with the index of all commodities, are 
published for the various countries. An additional index in 


INDEX NUMBERS OF PRICES 233 


which prices are corrected to the gold basis is also published 
for each of the foreign countries for which index numbers are 
constructed. 

The weights employed in the construction of the group 
index numbers are the quantities produced, imported or ex- 
ported in the year 1913. In securing the “‘all commodities” 
index the money aggregate for “‘goods produced”’ is added to 
that for “goods imported.” 

These index numbers have been constructed for the year 
1913, and by months from January, 1919, to date (for 
France, since January, 1920).! 


INDEX NUMBERS OF THE War INpDusTRIES BoarRD 


The most comprehensive index of wholesale prices ever 
compiled in the United States was that constructed by the 
Price Section of the War Industries Board in connection 
with its studies of prices during the war. The period 
1913-1918, inclusive, was covered by these studies. The 
results are shown in great detail in the History of Prices 
During the War, published by the War Industries Board, 
and are summarized in Price Bulletin No. 1 of that series. 

Price quotations were secured by the staff of this Board for 
1474 commodities during the stated period. In the con- 
struction of index numbers 1366 of these price series were 
utilized. These commodities were grouped in fifty com- 
modity classes and seven industrial groups. Index numbers 
were worked out for the fifty classes, the seven main groups 
and for “‘all commodities,”’ as well as for many minor divi- 
sions of the classes. 

These index numbers are all weighted aggregates of actual 
prices, expressed as relatives, a type similar to that of the 
United States Bureau of Labor Statistics. The weights 
consist of the quantities produced in the United States in 


1 For a detailed description of these index numbers see the pamphlet, 
Prices in the United States and Abroad, Federal Reserve Board, 1924. 


284 INDEX NUMBERS OF PRICES 


the calendar year 1917, plus the quantities imported in that 
year. Actual prices for a given month, for the commodities 
within a given class, multiplied by the proper weights, and 
totaled, give the money aggregate for that month for 
the particular class being studied. In combining the class 
aggregates to form index numbers for the seven main groups 
and for all commodities, each class was weighted in accord- 
ance with its relative importance, precisely the same method 
being employed as in weighting the various commodities. 

In converting these aggregates to relatives the base 
selected was the twelve month period, July, 1913, to June, 
1914, the twelve months preceding the outbreak of war. 
The base may be shifted at will, however, without loss of 
accuracy. 

In addition to the index numbers published by these 
governmental bodies there are several index numbers of 
wholesale prices compiled and published by private agen- 
cies. Of these, Bradstreet’s and Dun’s index numbers are 
the best known. 


BRADSTREET’S INDEX OF WHOLESALE PRICES 


Bradstreet’s index is published as a sum of actual prices. 
It is constructed by reducing to a “‘per pound”’ basis the 
prices of 96 staple articles of commerce, and securing the 
total of these “‘per pound”’ prices. No system of weighting 
is employed. The monthly index is constructed from prices 
on the first day of each month, not average monthly prices, . 
as in the ease of the Bureau of Labor Statistics index. Since 
the prices are not expressed as relatives, there is no base 
period. The aggregate prices published may be expressed 
as relatives on any month or year as a base. The index 
covers the period 1892 to date. 

It has been pointed out in an earlier section that all index 
numbers are weighted, whether the process be a conscious 
or an unconscious one. Bradstreet’s index in practice gives 


INDEX NUMBERS OF PRICES 235 


great weight to certain classes of commodities. In a list 
published in Bradstreet’s on July 10, 1897, the per pound 
price of coke was $0.0007 while that of alcohol was $0.34 
and of wool $.50. It is obvious that much greater weight 
is given each of the latter commodities when prices per 
pound are totaled. Warren M. Persons has worked out 
the following table showing the constitution of Bradstreet’s 
index on September 1, 1921. 


TABLE 56 


Distribution of Weights, Bradstreet’s Index on September 1, 1921 
“ Weight 
eS Eos (Percentage Distribution) 


Textiles, raw and manufactured: 


Cottonstapricssmr ce ck ee eta 16.7 

WVoolberawaracctesrn: clociine tesa 5.6 

Ober Aree Ae che Serre ate erate tas 3.3 
Provisions and Groceries: 

Milk, eggs, butter and cheese..... 8.6 

Cured pork and beef, and lard.... 6.9 

Miscellaneous provisions........ 12.4 
Hidessandwleather.. 4.400 ie 13.3 
Building smatertalss ..5:ae cits nesaeeres 1.5 
Coal and Coke: 

COR Ar yikes ceca cree 

CEker Seek eotats ak eaten ones 
Metals: 

Tron and psteelicn sti aoicuaceore Hef 

OURErMINeLa LS ete rece ie 4.3 
@henncalscandsaricsen. ae ect: aie 9.6 
ivierstock#ta sitchin ee ideee nobles 3.6 
OS eae rs see eas, hale eke ea aceite 4.0 
Beara ts ease ocak ne Meet Hs oseyvicks (0. 3.2 
IBreaastullsee met etnies revelers 1.0 
INaAValestorestn ia concet sora ttete Broa eealean 9 
Miscellancoussmemereae erie tine seers 4.2 


* Less than one tenth of one per cent. 

This distribution of weights gives great influence to cotton 
fabrics, hides and leather, chemicals and drugs and cured 
meats, and the characteristics of this index number are due 
in large part to this weighting. The weights, moreover, 

1 Review of Economic Statistics, Prel. Vol. II, 365-366. 


236 INDEX NUMBERS OF PRICES 


are not constant, for changing prices cause changes in the 
distribution of weights. Thus on February 1, 1920, textiles 
had a weight of 34.46% of the total. In the course of the 
ensuing eighteen months textile prices fell much more 
rapidly than general prices, so that by September 1, 1921, 
textiles had a weight of only 25.58% of the total. 

In spite of this illogical weighting system Bradstreet’s 
index has merit, and for certain purposes, particularly as 
a barometer of business conditions, is one of the most 
useful index numbers of wholesale prices. 


Dun’s INDEX oF WHOLESALE PRICES 


The index number of wholesale prices published by the 
mercantile agency of R. G. Dun and Co., and appearing 
monthly in Dun’s Review, resembles Bradstreet’s index in 
that it is presented as a sum of actual prices, with no base 
period. The method employed in its construction, the 
distribution of weights and the periods covered by the 
two index numbers differ materially. 

Dun’s index is a statement, in dollars and cents, of the 
cost of a year’s supply, for a single individual, of certain 
staple commodities. The monthly imdex is secured by 
multiplying the price of each of the staples included by the 
average annual per capita consumption of that commodity. 
Thus the index on January 1, 1921, was $198.600, which was 
the total cost, at prices prevailing on January 1, 1921, of 
a year’s supply, for a single individual, of all the commodi- 
ties included. The index is based upon prices prevailing on - 
the first day of each month. The prices of approximately 
300 commodities are utilized in its construction. No list 
of the actual commodities included has been published, nor 
has the exact number ever been stated. Quotations for 
only the necessities of life are utilized, luxuries being 
excluded. Though the weights are based upon per capita 
consumption, the index does not purport to be a measure of 
the cost of living. Staples such as pig iron, coal, building 


INDEX NUMBERS OF PRICES 237 


materials, etc., which are not consumed directly by in- 
dividuals, are included. This index, then, is designed to 
serve as a measure of general changes in the level of whole- 
sale prices. It was first published in 1901, but the calcula- 
tions have been carried back to 1860. 

Logically, Dun’s method of weighting each price by the 
average per capita consumption of the given commodity 
is excellent, but doubt arises as to how effectively the 
method is employed in practice. Adequate data concerning 
the consumption of commodities in the United States are 
lacking, except for a relatively few articles. Moreover, 
it is known from the published figures that the weight given 
to food is approximately 50 per cent of the total, a figure 
which is believed to be somewhat excessive in the light of 
such pertinent facts as are available. This excessive 
weighting of food accounts for certain of the characteristic 
movements of Dun’s index, and explains its failure to accord 
closely at all times with the other general index numbers. 


Fisnpr’s Wrerekty InpeEx NuMBER OF 
WHOLESALE PRICES 


In January, 1923, Irving Fisher launched a new index 
number of wholesale prices, computed and_ published 
weekly. This is the first comprehensive price index to 
appear on a weekly basis. 

The index numbers are published as relative prices, on 
the 1913 base, but the relatives are derived from weighted 
aggregates of actual prices. The weights, which are con- 
stant, are the quantities marketed in 1919, as recorded by 
the Census Bureau. The formula employed is, therefore, 
the first aggregative described above 


2140 : 
> Poo 


except that weights are drawn from 1919 instead of the base 
year. The procedure is substantially that followed by the 


238 INDEX NUMBERS OF PRICES 


United States Bureau of Labor Statistics. Fisher considers 
this to be the closest approach to the “‘ideal’’ index which 
the data permit. 

In the matter of weighting this index number marks an 
interesting innovation. The index is based upon prices of 
205 commodities, drawn from Dun’s Review. For the 
most important commodities the actual quantities for 
the year 1919 are used as weights, with one slight correc- 
tion to provide an appropriate class weight. To facilitate 
computation the weight given to each of the other com- 
modities is one of the following numbers: 1000, 100, 10, 1, 
.1. In each case the round number nearest (geometrically) 
to the actual quantity is used as weight. This procedure 
was adopted after it had been established that the use of 
round weights for the less important commodities involved 
an error of less than one per cent. 

The base aggregate of Fisher’s index in the year 1923 was 
that of November, 1922, but the weekly index has been 
spliced on to the index of the Bureau of Labor Statistics. 
For November, 1922, the index of the Bureau of Labor Sta- 
tistics was 156. In computing the 1923 indices the ratio of 
the aggregate for a given week to the base aggregate (Novem- 
ber, 1922) was multiplied by 156. The present base aggre- 
gate is that of November, 1923, but a similar splicing pro- 
cess permits the index to appear as a relative, with 1913 as 
base. 

The index for each week is published in the daily press on 
Monday of the following week. This promptness of publica- : 
tion gives an added value to the index, which is one of the 
most useful general price index numbers currently compiled. 


Persons’ Commopity Prick INDEX 
oF Business CYCLES 


The index numbers which have been described above 
have all been designed to serve as measures of changes in 


INDEX NUMBERS OF PRICES 239 


the general level of prices, though their real significance is 
limited to the field of wholesale commodity prices. Several 
price index numbers have been constructed for special 
purposes, and it is probable that the future will see an 
increasing number of such restricted purpose indices. Of 
these, one of the most interesting is the ‘‘commodity price 
index of business cycles’’ constructed by Warren M. Persons 
and Eunice S. Coyle. 

The purpose for which this index has been constructed is 
“to measure changes in general business conditions during 
alternating periods of prosperity and depression.’ Other- 
wise stated, the problem is “‘to select and combine series 
of wholesale prices of commodities in order to secure an 
index of business cycles.”’ The business cycle is primarily 
a price phenomenon. Other factors are of great importance, 
but all other forces are felt through their influence upon 
the price system, and it is through price changes that the 
cycle is brought home most immediately to business men, 
employees and consumers. While all elements in the price 
system are affected, security prices and wholesale com- 
modity prices feel the influence of cyclical changes most 
directly and violently. Professor Persons has chosen to 
construct a wholesale price index which will reflect and 
measure such price movements as are directly related to the 
business cycle. His problem, then, is fundamentally differ- 
ent from that of makers of ‘‘general purpose’’ index num- 
bers, and his methods are correspondingly different. 

In determining the constitution of this index tests were 
applied to a number of price indices for industrial groups 
and to individual price quotations for a number of commodi- 
ties, the object being the selection of those price series the 
movements of which corresponded with ‘‘typical cyclical 
fluctuations” of prices and general business during the 
period 1903-1914. Earlier investigations had shown that 
the cyclical movements of the price index numbers of the 


1 Cf. Review of Economic Statistics, Prel. Vol. III, 353-369. 


240 INDEX NUMBERS OF PRICES 


Bureau of Labor Statistics and of Bradstreet’s, as well as 
many non-price series, showed a general agreement during 
this period, crests of the wave movements coming in 1907, 
1910 and 1912. The wave movement which was common 
to these various series was taken as the standard, and was 
employed in testing various price series in order that those 
with non-typical fluctuations might be eliminated. 

Index numbers of prices for various industrial groups were 
first constructed, and these were tested in the manner 
described. It was established that price index numbers for 
mining, and for manufactures of iron and steel, leather, 
chemicals and textiles showed typical cyclical fluctuations; 
their movements corresponded with the cyclical swings of 
business. The curves of the index numbers for agriculture, 
industries manufacturing food products, and the stone, 
clay and glass manufacturing group showed non-typical 
fluctuations; price movements in these series did not 
correspond with the swings of general prices and of general 
business conditions. 

The investigation was carried further, price movements 
of individual commodities being tested in the same way. 
All those price series which showed non-typical fluctua- 
tions, as well as those which were inflexible, were excluded 
as being unsuited to the purpose in mind. It was found 
that several series within non-typical industrial groups 
showed typical fluctuations; moreover, the prices of in- 
dividual commodities were found to be more flexible and 
sensitive than group index numbers. For these reasons it 
was decided that the final price index of business cycles 
should be constructed from individual price series and not 
from group index numbers. 

The commodities finally selected for inclusion were only 
ten in number. They included cotton-seed oil, coke, pig 
zinc, pig iron, bar iron, mess pork, hides, print cloths, 
sheetings and worsted yarns. The prices of these com- 
modities are all highly flexible and sensitive to business 


INDEX NUMBERS OF PRICES 241 


changes. The industries they represent, moreover, are 
important industrially. 

When price quotations for these ten commodities had 
been secured extending back to 1890, the desired ‘“‘price 
index of business cycles”? was constructed. Prices for each 
month and year were expressed as relatives on the 1890-99 
base. (The base has since been changed to 1919.) The 
index is an unweighted geometric mean of these relative 
prices. 


OTHER WHOLESALE Prick InpEx NuMBERS 


Somewhat similar in character, though not constructed 
with the avowed purpose of serving as an index of business 
cycles, is the wholesale price index of 20 basic commodities 
compiled by the Federal Reserve Bank of New York. It 
was constructed to serve as a sensitive index of price changes 
in the basic commodity markets. As published in the 
Monthly Review of the Federal Reserve Bank of New York 
this index appears as a relative, with 1913 as the base. It 
has been computed only for the period 1913 to date. 

An index of wholesale food prices is published weekly 
by the Annalist. It is based upon price quotations of 25 
basic articles of food. These are reduced to relatives on 
the base 1890-99, and combined in the form of an un- 
weighted arithmetic mean. The index purports to serve 
only as an index of changes in the cost, at wholesale, of the 
food called for by a theoretical family food budget. 

Another index of wholesale food prices is published by 
Thomas Gibson of New York, and appears in his weekly 
market letter. It is based upon the prices of 22 primary 
articles of food. No full account of the weighting system 
has been given, though it is stated that Dun’s general method 
has been employed. The index, as published, is based on 
weighted relative prices, the base of each being the average 
price during the period 1890-99. The result is modified, 
however, through multiplication by a constant factor, a 


242 INDEX NUMBERS OF PRICES 


process originally designed to bring this index into harmony 
with Dun’s, of which it was to serve as a continuation. 


DIFFERENCES BETWEEN INDEX NUMBERS 


If the index numbers of wholesale prices which have been 
described are compared over a period of years! there is 
found to be a rather surprising agreement as to the general 
direction of price movements, but certain significant differ- 
ences in details appear. Thus one index number may show 
an increase in the price level from one month to the next, 
while another index may show no change, or even a decline. 
Such differences are accentuated in periods of violent price 
fluctuations, such as those which occurred from 1916 to 
1921. The reasons for such differences have been suggested 
in the discussion of methods of making index numbers. 
In every case such contradictory movements may be traced 
to one or more of the following: 


a. Differences in the number of commodities. 
b. Differences in the kinds of commodities included. 
Differences in the method of securing price quotations (e.g., 
contract or open market; first of month or average of 
monthly prices). 
d. Differences in weights. 
e. Differences in methods of averaging. 


SumMARY: INDEX NuMBERS OF WHOLESALE PRICES 


The following summary of the main characteristics and - 
uses of the different index numbers is based upon a con- 
sideration of these differences. 


1. Of indices currently published the best general purpose 
index number of wholesale prices in the United States is that of 
the United States Bureau of Labor Statistics. It is a measure, 


1 Such a comparison of the three chief American index numbers of wholesale 
prices is made by W. C. Mitchell in “The Making and Using of Index Numbers,” 
Part I, Bulletin 284, U.S. Bureau of Labor Statistics, 94-112. 


INDEX NUMBERS OF PRICES 243 


however, not of the average ratio of change in prices, but of 
variations in the money cost of a constant quantity of goods. 

2. For the period 1913-1918 the index of the War Industries 
Board is the best general index of wholesale prices. 

3. Dun’s index is a good measure of wholesale price changes, 
but its value is lessened somewhat by the undue weight given to 
food products and by the failure of its makers to announce the 
commodities included and the exact weights employed. 

4. Bradstreet’s index is a good measure of wholesale price 
changes, but the weight given to raw materials detracts from its 
representative value. The absence of logical weights is another 
weakness of some importance. Because of the weight given to 
raw materials this index serves as a better business barometer 
than either of the other two major indices. 

5. Persons’ index is not designed to be a general measure of 
price changes. It is well adapted to perform the function for 
which it is constructed, that of serving as an index of business 
cycles. 

6. The index of 20 commodities constructed by the Federal 
Reserve Bank of New York serves as a useful measure of changes in 
prices of basic materials. 

7. The indices of the Federal Reserve Board, constructed for 
the purpose of international comparison, perform a highly useful 
function in facilitating comparison of price movements in different 
countries. The separate index numbers constitute good measures 
of general price movements in the various countries for which 
they are constructed. 

8. Fisher’s weekly index of wholesale commodity prices serves 
a very useful purpose as a comprehensive and accurate price 
index. Its utility is enhanced by its prompt publication. 

9. The Annalist index of wholesale food prices is of limited 
value. The method of construction (securing an unweighted 
average of relative prices) is likely to give results of doubtful 
value, and the base period (1890-99) is too far removed. 


OruerR Price InpDEx NumMBeErs 


The measurement of price changes by the use of index 
numbers has not been confined to wholesale prices. Many 


244 INDEX NUMBERS OF PRICES 


variations of this device have been utilized in measuring 
price movements in other fields. It will be useful at this 
point briefly to indicate the character of some of these 
variations. 


Inpex NuMmMBErS OF ReEtatt PRICES 


An index of retail food prices is published currently by the 
United States Bureau of Labor Statistics. The general 
methods employed are similar to those already explained 
in connection with the index of wholesale prices computed 
by that agency, with such differences as inevitably result 
from the nature of the material. 

On the 15th day of each month actual retail selling prices 
of 43 articles of food ! are secured from dealers in 51 repre- 
sentative cities throughout the United States. Though 
there are variations in local practice comparable quotations 
are secured, in so far as this is possible. The total of the 
money prices for each article is taken, and the average for 
the United States secured by dividing this sum by the num- 
ber of dealers reporting prices on that article. A weighted 
aggregate of these individual prices is then secured. Each 
article is weighted in accordance with its relative importance 
in the consumption of the average family in 1918, as de- 
termined by various budget studies conducted by the 
Bureau. The method of securing weighted aggregates of 
actual prices resembles that employed in the construction 
of the wholesale price index. The retail price aggregates 
are also turned into relatives on the 1913 base, and these- 
relatives are currently published in the Monthly Labor 
Review. 

Similar operations are carried through for the 51 indi- 
vidual cities, and an index number of retail food prices is 
‘published for each of these cities. The actual commodity 
prices, averaged for the United States and for each of the 


! The index has been based upon 48 articles of food since January, 1921; prior 
to that date the index was based upon 22 articles of food. 


INDEX NUMBERS OF PRICES 245 


cities, are published in occasional retail price bulletins, a 
practice which adds materially to the value of the work 
done by the Bureau of Labor Statistics. 

The difficulties inherent in the problem of measuring 
wholesale price movements have been discussed at some 
length. The construction of index numbers of retail prices 
of the type just described presents even greater difficulties. 
All the theoretical problems arising in the former case are 
to be solved and, in addition, the practical difficulties of 
securing suitable weights, accurate price figures and com- 
parable quotations are intensified. Because of the lack of 
commodity standardization, and because of variations in 
business practice and local customs, the latter difficulty 
is particularly acute. For these reasons no index of retail 
prices at present published can be accepted with the con- 
fidence with which the best indices of wholesale prices 
may be received. 


InpEXxX NUMBERS OF THE Cost oF LIVING 


If these problems are acute in constructing an index of 
retail prices they are doubly hard to solve in measuring 
such an entity as the cost of living. When food prices, 
rents, retail clothing prices, cost of fuel and light, retail 
furniture prices and the prices of the other miscellaneous 
items which are included in the budget of the average family 
are to be averaged, and an index number constructed to 
measure variations in the cost of these items, numerous 
statistical difficulties must be overcome. Theoretical 
questions concerning the most suitable methods of averag- 
ing and weighting present themselves, but more important 
are the practical problems involved in the collection of 
accurate and comprehensive data. Index numbers of the 
cost of living, therefore, must be used with particular cau- 
tion until their reliability has been more effectively demon- 
strated than has been the case to date. 


246 INDEX NUMBERS OF PRICES 


Two index numbers of the cost of living are currently 
compiled, one by the Bureau of Labor Statistics, one by 
the National Industrial Conference Board, an organiza- 
tion of employers’ associations. The former appears in the 
Monthly Labor Review, the latter in publications of the 
National Industrial Conference Board. In each case the 
chief items of domestic expenditure are weighted in accord- 
ance with their relative importance in household budgets, 
and the combined result expressed as a relative, with 1913 
values as the base. (July, 1914, is the base for the Indus- 
trial Conference Board index.) 


InpEx NuMBERS OF PRICE AND BUYING 
Power oF Farm Propucts 


A set of very useful index numbers is compiled by the 
United States Department of Agriculture. One of these 
is designed to measure, by months, changes in the prices 
received at the farm for the chief agricultural crops. An- 
other measures changes in the farm prices of live-stock 
products. The two are combined into a third index measur- 
ing changes in the average price of all farm products. All 
these index numbers are published as relatives, with 1913 
as base. 

The significance of changes in the index of prices of farm 
products is made clearer by the computation of a fourth 
index of purchasing power of farm products. The index of 
money prices means little by itself, since the purchasing ~ 
power of money is constantly changing. Accordingly, it 
must be interpreted in terms of an index which measures 
changes in the prices of the things the farmer buys with 
his money. Such an index is secured by re-compiling the 
index of wholesale prices of the Bureau of Labor Statistics, 
excluding farm products and foods. The computation of 
the index of purchasing power of farm products may be 
illustrated, with reference to the figures for January, 1923. 


INDEX NUMBERS OF PRICES Q47 


For that month the index of prices of farm products was 
116 (1913 = 100). The index of commodity prices at 
wholesale, excluding farm products and foods, was 170. 
That is, the farmer was receiving 16 per cent more for his 
products than in 1913, but general commodities at whole- 
sale cost 70 per cent more than in 1913. Therefore his 
purchasing power, with reference to 1913 as base, was equal 


the year 1923, are given in the following table. 


TABLE 57 
Commodity Price Indices 
Index Numbers of Price and Buying Power of Farm Products, 1923 


(1913 = 100) 

Prices at the Farm * Whole- + Pur- 

Cae sale chasing 

Month Crops, nt ipa oe Price Power 

ned 15th of Stock of Com- of Farm 

Month M Sees modities Products 

onth | Combined 
+ —p— 

LTE te ai er eee 126 is 106 116 ro 68 
Mebruarynowtss sos cee. 130 107 118 172 69 
IVE Srlaics tieyerrcis sci essere! 134 106 120 175 69 
PASE terse open (ete enact ee 139 107 123 176 70 
WWMayir ice tiene ose 140 105 123 172 fal 
WING ihe devs oisie, oios Mot 139 100 120 168 71 
UU ars Ara, aera'atenst eye atensys.s 136 102 119 165 72 
PANI GUS Ete rtedersie ere ere ot 136 102 119 163 73 
September... ci aso. 138 109 123 164 75 
ODetoherrs cece 139 103 121 161 75 
INGVEMDED S25 6s oe nieve 137 97 117 160 78 
December............ 137 94 116 158 73 


* Excluding farm products and foods. { Expressed in terms of other products. 


It should be clear that one important assumption is made 
in the computation of the above index of purchasing power. 
This is, that changes in the prices of the things the farmer 
buys are measured by the corrected index of wholesale 


248 INDEX NUMBERS OF PRICES 


commodity prices of the Bureau of Labor Statistics. This 
may be questioned, for the farmer does not generally buy 
at wholesale, nor does he buy just the things included in that 
index. It is used, of course, in default of a more appropriate 
index, but the possible errors it introduces should be recog- 
nized. 

These index numbers have been compiled for the period 
from 1913 to date, by months. They are published in 
Crops and Markets, issued by the Department of Agriculture. 


InpEx Numpers oF Money WAGES 
AND REAL WAGES 


In years when industrial troubles are numerous particular 
interest attaches to the course of wages. And though the 
interest is intensified at these times, the problem is one of 
permanent importance. Many index numbers of wages 
have been prepared by governmental agencies, employers’ 
associations, labor unions, and private investigators. No 
attempt to evaluate these various indices will be made at 
the present time; for our present purpose it is sufficient that 
a few points of importance concerning the construction 
and interpretation of such index numbers be stressed. 

The problem involved in the construction of an index of 
money wages is essentially one of averaging prices, the 
prices in this case being those of labor instead of commodi- 
ties. Difficulties at once arise because of the various ways 
in which the “‘prices” of labor are quoted. Shall the in- . 
dex of wage movements be based upon hourly rates of pay, 
full time weekly earnings (i.e., the amount which would be 
earned at prevailing hourly rates if a full time week were 
put in), or actual weekly earnings? An index based upon 
hourly rates would take no account of changes in the hours 
of labor, unemployment, under-employment, overtime, or 
special bonuses. Changes in the hours of labor would be 
reflected in an index of the second type, but the other 


INDEX NUMBERS OF PRICES 249 


objections would still hold. An index of actual weekly 
earnings, if it covered seasons of slack as well as full work, 
would appear to be the most useful type. But data for this 
sort of index are exceptionally hard to obtain, a fact of 
considerable practical importance. Such an index is con- 
structed by the New York State Department of Labor, 
figures being secured from representative factories through- 
out the State. This index is given as an actual average in 
dollars and cents, though these figures may be reduced to 
relatives on any base chosen. 

Apart from the type of wage figure which shall be used, 
there are questions as to methods of weighting and averag- 
ing which need not be discussed here. Probably of greater 
importance is the matter of “‘representativeness,” corre- 
sponding to the problem as to the kinds of commodities to be 
included in a wholesale price index. If wages of skilled 
workmen alone are averaged the result cannot be presented 
as an index of general wage rates. Again, if the wages of 
foremen and highly paid technical workers are included in 
a general wage index they should be given only the weight 
they deserve and the nature of the resulting index should 
be clearly explained. In fact, in this case, as with wholesale 
prices, it is highly desirable that separate index numbers 
be constructed for the different classes of wage-earners, 
these being later combined into a properly weighted average 
of all wages. Our knowledge of the nature of wage move- 
ments, as of price movements, is more likely to be extended 
by breaking up the materials to be averaged into significant 
and homogeneous groups and studying these groups in 
detail, than by perfecting a single index of a great number 
of heterogeneous classes. 

Differences in practice in regard to some of the points 
suggested above have led to numerous and important 
differences between published index numbers of wages. 
Unless the methods employed in the construction of such 
index numbers are fully explained and the nature of the 


250 INDEX NUMBERS OF PRICES 


material entering into them completely described, the 
story of wage movements which they tell may only be 
accepted with reservations. 

An index of money wages would be significant in itself 
only if the cost of living remained unchanged. But when 
prices and the cost of living are rising, the gains registered 
by a rising index of wages may be purely illusory; the same 
is true of losses under opposite conditions. If a measure 
of relative well-being is desired it is necessary to convert 
the index of money wages into an index of real wages, which 
shall take account of changes in the cost of living. 

The example which follows will serve to illustrate the 
process of constructing an index of real wages. The index 
of wages is based upon average weekly earnings of ap- 
proximately 475,000 workers in New York State, earnings 
for December, 1914, being taken as the base. The cost of 
living index is that of the United States Bureau of Labor 
Statistics for thirty-two cities throughout the country, the 
figure for December, 1914, being taken as 100.! 


TABLE 58 


Index Numbers of Real Wages in New York State, 1914-1923 
(December, 1914 = 100.) 


Relative weekly Coot af Real Wages 
Date earnings of New York ote (i.e., Purchasing power 
workers 9 of money wages.) 
December, 1914 100 100 100 
o 1915 107 102 105 
cs 1916 123 115 107 
AY 1917 140 138 101 
< 1918 185 169 109 
se 1919 209 193 108 
1920 225 195 115 
He 1921 | 198 169 117 
oe 1922 210 165 127 
1923 222 168 132 


' It would be more logical in the present instance to correct the New York wage 
figures by a cost of living index for New York cities alone. The difference in the 
results would not be material, however. 


INDEX NUMBERS OF PRICES 251 


In securing the index of real wages, the index of relative 
earnings is divided by the cost of living index for the cor- 
responding date. It is clear that the earnings figure alone 
is quite inaccurate as a measure of the condition of the 
workers. When corrected it shows an increase in real wages, 
but one much less pronounced than the gain in money 
wages. Such figures at their best, however, may be ac- 
cepted as only approximations to the truth. There is a 
considerable margin of error in the construction of both 
sets of index numbers from which the measure of real wages 
is computed, so that the latter should only be looked upon 
as a rough index of changes in the real income of New York 
factory workers. 


REFERENCES 

Bowtry, A. L. Elements of Statistics (196-213). 

Cuappock, R. E. Principles and Methods of Statistics (Chap. X). 
Fisuer, Irvine. Revision of the Weekly Index Number. Jour- 
nal of the American Statistical Association, Sept., 1924. 

Fisoer, Irvine. The Making of Index Numbers. 

Firux, A. W. The Measurement of Price Changes. Journal of 
the Royal Statistical Society, March, 1921 (167-215). 

Kewiey, Truman L. Statistical Method (331-347). 

Mircuetit, W. C. The Making and Using of Index Numbers. 
Part I, Bulletin 284, U. S. Bureau of Labor Statistics. 
Oct., 1921. 

Persons, WarREN M., and Corin, Eunice. A Commodity Price 
Index of Business Cycles. Review of Economic Statistics. 
Prel. Vol. III (353-369). 

Persons, WARREN M. Fisher’s Formula for Index Numbers. 
Review of Economic Statistics. Prel. Vol. III (103-113). 

Wausu, C. M. The Measurement of General Exchange Value. 

Watsu, C. M. The Problem of Estimation. 

Youna, Attyn A. The Measurement of Changes in the General 
Price Level. Quarterly Journal of Economics, Aug., 1921 
(557-573). 

Younc, Attyn A. Index Numbers (in Rietz, H. L., Handbook 
of Mathematical Statistics, 181-194). 


CHAPTER VII 


THE ANALYSIS OF TIME SERIES: MEASURE- 
MENT OF TREND 


The preceding sections have dealt primarily with fre- 
quency series and with problems arising in the attempt to 
organize and describe such series. We are now concerned 
with data in the study of which the essential problem is 
the analysis of chronological variations. Such series are of 
major importance in the field of economic statistics, for most 
of the data of economics and business are variables in time 
— as bank clearings, steel production, volume of sales, ete. 
This dominating importance of series in time is not found 
in any other field of statistical research, and the develop- 
ment of methods of analysis appropriate to time series has 
come, accordingly, only within recent years with the wider 
adoption of statistical methods in the field of economics. 

Problems connected with time series arise both in the 
ordinary routine of internal administration and in the 
analysis of general economic conditions. Sales, purchases, 
profits on the one hand, stock prices, interest rates, business - 
failures on the other, are variables which fluctuate with the 
passage of time. In the analysis of such series it is generally — 
desired that the rate and character of growth be determined, 
and that periodic and accidental fluctuations be isolated 
for study. The sales manager wishes to know how the 
volume of sales is faring, when and why it fluctuates and 
how it compares with volume of production. The economist 
desires to know the trend of prices, and to scrutinize mi- 
nutely the upward and downward movements of the price 
level. The making of business plans on even a small scale, 

252 


MEASUREMENT OF TREND 253 


as well as the most elaborate schemes of economic forecast- 
ing, must rest upon such study of past trends and fluctua- 
tions, and upon comparison of the movements of related 
series in time. Scientific study of the business cycle is only 
possible through the application of such methods. Our 
present task is the development of methods appropriate to 
the analysis of series in time. 


Tue PRELIMINARY ORGANIZATION OF TIME SERIES 


The data of time series usually require less preliminary 
organization than statistical data which are to be reduced 
to the form of a frequency distribution. The source, 
primary or secondary, from which the figures are taken 
usually presents them in shape for analysis. Certain pre- 
cautions should be observed, however. 

The dates to which the figures apply should be clearly 
understood and definitely stated. Monthly data may be 
as of the first of each month (as in the case of Bradstreet’s 
index number of wholesale prices), averages for each month 
(as in the case of the Bureau of Labor Statistics’ price 
index), or totals for each month (as in the case of figures 
for cotton consumption). They may be cumulative monthly 
figures, each item representing the total for the year to date, 
as in the case of certain coal production data. If average 
figures are given for a month or year it is important to know 
how the average has been secured. 

Again, it is essential that in any time series there be 
strict comparability between data for different dates. Any 
attempt to analyze a series which is not homogeneous must 
be misleading and futile. Yet such series are not infre- 
quently published. Commodity production or consump- 
tion figures published by trade associations and by govern- 
mental agencies are often based upon returns from a varying 
number of reporting concerns. A series of price quotations 
may lack comparability as between different dates because 


254 THE ANALYSIS OF TIME SERIES 


of slight changes in the unit or grade to which the quotation 
applies, or because quotations are drawn from different 
markets. Changes in census classifications may result in 
lack of comparability in census data. Merchandise import 
and export figures are reported to have been compiled, 
until recently, in ways which render misleading comparison 
between the figures for different periods. A change in a 
salesman’s territory may alter his returns materially. It 
is stated that the character of the obligations represented 
by the United States Steel Corporation’s figures for “un- 
filled tonnage” varies from time to time. Thus in July, 
1922, it was reported, the figures meant actual orders for 
shipment at mill convenience, while a year earlier much of 
the business on its books was in contract form, with date of 
shipment not specified. These are examples of faults which 
may be found in time series, rendering analysis futile. 
Strict testing is essential before a series be accepted as 
accurate and homogeneous. 


GRAPHIC REPRESENTATION OF TIME SERIES 


Normally the first step to be taken in visualizing a series 
in time and in preparing for further analysis consists of 
plotting the data. The trend and general characteristics 
of a series may be most readily apprehended through 
graphic representation. The data may be plotted on 
ordinary arithmetic paper, on semi-logarithmic paper or, 
more rarely, on double logarithmic paper. The advantages _ 
of the latter types for certain purposes have already been 
explained. The choice in a given case will depend upon the 
nature of the data and the object of the study. If interest 
lies in the absolute amount of fluctuations in sales, prices, 
pig iron production or whatever may be in process of analy- 
sis, or in the comparison of absolute differences between 
series, the ordinary rectilinear chart is to be employed. If 
percentage variations and the comparison of relative 


MEASUREMENT OF TREND 255 


fluctuations are matters of interest the semi-logarithmic 
representation is preferable. In general, if one is accustomed 
to the interpretation of this latter type of chart, its use is 
advisable. A clearer, less-distorted presentation of rela- 
tions and a more significant comparison of series are 
generally secured when economic data having time as one 
variable are plotted on the semi-logarithmic paper. 

For some purposes the process of studying series in time 
will have been completed when the data are thus plotted. 
The general trend may be roughly determined from the 
chart. The existence of seasonal and other periodic varia- 
tions may be ascertained. Rough comparisons of trends 
and fluctuations may be made. All the knowledge thus 
secured, it should be noted, will be non-quantitative in 
character, and the comparisons will be tentative and 
approximate. Even so, such charts enable trends and 
relations to be much more clearly visualized than do the 
raw figures, and for some purposes the knowledge thus 
secured is sufficient, though it lacks precision and accuracy. 
For other purposes more exact measurement and more 
refined analysis are required. Certain appropriate methods 
may be described. 


Forces AFFECTING SERIES IN TIME 


The general object in the analysis of a time series is the 
isolation of the effects of one or more of the forces affecting 
the given series. This may be desired in order that the 
past behavior of the single series may be understood, in 
order that the future behavior of the series may be pre- 
dicted, or in order that two or more series may be compared. 
It is not in any case possible to isolate these effects of 
individual forces with absolute accuracy, and in some cases 
it is impossible even to approximate such a result. But, in 
general, given figures covering a sufficiently long period, 
the effects of various influences upon the behavior of a given 


256 THE ANALYSIS OF TIME SERIES 


series may be measured with some degree of accuracy. 
The possibilities of using such measurements in economic 
analysis have by no means been fully realized. 

What are these forces which affect series of data in time? 
The forces in any given case may be unique, affecting only 
the given series, but in general the various influences acting 
upon such series may be placed in a limited number of 
categories. 


SECULAR TREND 


In the first place, most series of economic statistics 
exhibit a definite trend, which may be constant in direction, 
changing direction at a constant rate, or even characterized 
by abrupt changes in direction or rate which are due to the 
introduction of novel elements. Thus the volume of 
production or sales of a business house, taken over a period 
of years, will usually show a.fairly regular growth. The 
same is true of population, the production of basic minerals, 
the number of motor vehicles registered, etc. In some 
cases the rate of growth may be a negative one, as in the 
case of interest rates in the United States over the last half 
century. The concept of secular trend (i.e., trend over a long 
period of time) covers both positive and negative changes 
of this type. 

In the analysis of a time series the trend value at 
any date is taken to be the normal value at that date. 
That is, it is viewed as the value which would be recorded 
for the given series if the effects of all accidental and com- . 
plicating forces could be eliminated, leaving only the effect 
of normal growth. This conception of a normal value for 
any series at a given date, a value which may be used as 
a base or point of reference in judging the effects of all 
forces other than the growth factor, is fundamental in 
economic analysis. “No other method,” says Carl Snyder, 
“enables us so quickly to set economic events in their just 
perspective.” 


MEASUREMENT OF TREND 257 


The fact should be emphasized that by secular trend is 
meant the smooth, regular, long-term movement of a statis- 
tical series. Frequent and sudden changes either in absolute 
amounts or in rates of increase or decrease are quite incon- 
sistent with the concept of secular trend. It is true that 
there may be occasional changes due to the interjection of 
a new element or the withdrawal of an old factor. But 
the breaking up into numerous sections of the period covered 
by a time series, and the determination of trend for each 
of these minor periods, does violence to the whole concept 
of gradual change over a period of time. If the factors 
affecting the series are subject to violent and frequent altera- 
tion, no secular trend can be assumed. 

It does not follow from this discussion that a definite 
upward or downward trend exists for all time series. Many 
series, such as barometric readings at a certain point, 
merely fluctuate about a constant level which does not 
change with the passage of time. 


Preriopic FLUCTUATIONS 


If the plotted representation of a time series be studied, 
the long-term trend may be discerned in the general upward 
or downward drift, but may not be precisely determined 
by inspection because of the existence of numerous fluctua- 
tions, superimposed upon the trend. ‘These fluctuations 
may be regular or irregular, violent or mild, simple or 
complex. The value of the variable at any given date 
represents the net resultant of the interaction of the secular 
trend and the various forces which tend to pull the variable 
above or below its trend value. These latter forces, con- 
stituting the disturbing factors in a situation which would 
otherwise be characterized by gradual and normal change, 
may be of several types. 

Seasonal variations are found in most series of economic 
statistics for which monthly values are obtainable. Con- 


258 THE ANALYSIS OF TIME SERIES 


sumption and production of commodities, interest rates, 
bank clearings, railroad freight traffic and many other 
types of data are marked by seasonal swings repeated with 
minor variations year after year. These, in so far as they 
exist at all, are definitely periodic in character, with a con- 
stant twelve-month period. Less markedly periodic, but 
nevertheless characterized by a considerable degree of 
regularity, are the cyclical fluctuations which are found in 
series affected by forces connected with economic or business 
cycles. Prices, wages, the volume of industrial production, 
trading on the Stock Exchange, and most series relating 


to the activities of individual business units are affected 


by the swings of business through alternating periods of de- 
pression and prosperity. While the length of such periods 
may vary, the general sequence of change has been in the 
past sufficiently regular to render such movements capable 
of study. 


Ranpom FLucTUATIONS 


Entangled with these more or less regular movements 
are the effects of random, accidental and irregular fluctua- 
tions— catastrophic events such as the San Francisco earth- 
quake, wars, floods, fires and countless minor events 
equally fortuitous though less violent in the resulting 
disruptions. These, like the seasonal or cyclical fluctua- 
tions, may serve to pull the actual value of a variable at a 
given date away from that normal value which would be 
expected were the regular and uniform change due to secular - 
movement alone taking place. The net resultant of all 
these disturbing values determines how far above or below 
normal the actual value on a given date may be. 

The analysis of series in time involves the isolation of the 


effects of these various forces, in so far as this is possible. 


A problem may call for the study of but one factor, or it 
may require the complete breaking up of given values. 
When annual data are used the seasonal element will not 


MEASUREMENT OF TREND 259 


enter, of course. The explanation of methods begins with 
a consideration of problems involving only this type of data. 


Tue MEASUREMENT OF SECULAR TREND 


As an example of the type of material in connection with 
which these problems arise, the following figures may be 
taken. The values are given in thousands of millions in 
order to simplify the calculations. 


TaBLe 59 
Bank Clearings in New York City, 1860-1923 


(in thousands of millions) 


SEO. turpar da faker Gee c $21.6 | 1892....... SE36r 70 L908 eee $ 79.3 
SOU ee roses SOM AST clare 2373.01 893.5o.2.0115 SIZ LOO ese 103.6 
MSOZ its & CTR Ye tea Re ek Paes Cones Del Pa es one 24.4 | 1910....... 97.3 
WSGS2 6 eis 1429) PISTON ee. Pera) WEN bobo yA) || UN kese 92.4 
MSGS Feiss. PAIS) USS0) sees: Saeirhiel| IK ley fen ain 6 Prelate | AREAS bo ged 100. 

1865 ese 5 26:0) (1881... oe: 49.4 | 1897....... BS4) TOUS sete. r. 94 /6 
BSE eo cirs as pdedei’ |) Uisss) eae Gee 4G 911898 293.1050 AZO.) 1914. 83.0 
ESO hiversveye 10s rae ffol|| Mere oego 6 are Sa L899 nese CORSS PLOTS eee 110.6 
BSS S ee ie cress 250-} 188405. 2... SOF eh SOO Beare ets Beet || MONO, fhec. 159.6 
ESCO Ss: OT FEN MASBO See. Po en LOO eres (OPA LON Fe ne. 177.4 
ESTO Re sie iat Diu |igh BOOLn. eaee 33.7 | 1902.. 1623-| 1918 ye... 178.6 
ES Ne ace NG Wal lead Reheot (eam SBA L908 sieve GOFOR LOTS. 235 .8 
LI ae oie BOON LOSS, toreco sae SEM L008 ee. a GG |) MLDS oan aoe 243 2 
BSUS) ce Aefoxos SILOM MOOD sais 74 915 35:39) |p W905 os 0:3 OS Sa OU ey. cars 194.4 
BSC Ali crass ct es LSOO ran SUA O06 fe oe ae» 104 ST | 192 eae as 217.9 
ele aseeaes came 257 PW NSO ieee ae fosritad| UCU Cr ceb cure Refi sl) UCT cen rene 214.0 


As has been pointed out, the figure for any year, as the 
value of $100.7 thousands of millions for 1912, is the net 
resultant of many forces which may be classified as the 
secular trend, cyclical variations and random or accidental 
fluctuations. Our first problem is to determine the effect 
of the first of these forces. 

In Fig. 54 the data of New York bank clearings during 
the period 1860-1923, inclusive, have been plotted. A 
definite trend is apparent, together with well marked and 


1Jt is generally advisable to deflate such series as this before attempting 
further analysis. The procedure is explained below. 


260 THE ANALYSIS OF TIME SERIES 


more or less regular deviations from that trend. Several 
methods are available for arriving at approximations to 
this trend. By employing moving averages an attempt may 
be made to eliminate the effect of passing fluctuations and 
to arrive at values which represent the effect of the steadily 


Legend 
—— Original Data 
Five Year Moving Average 
Nine Year Moving Average 


Fie. 54. — New York Bank Clearings, 1860-1923, with Moving Averages 


operating growth factor. Assuming a definite functional 
relationship between the time factor and the other variable 
to prevail, empirically at least, an approximation to the 
trend may be secured by fitting an appropriate curve 
to the plotted data. Smoothing the data by hand gives 
somewhat the same result, the curve being frankly approx- 
imative and empirical in character. In certain studies it 
has been found possible to use one statistical series as base 
or trend line for another series of homogeneous data. 


Movine AVERAGES 


When a trend is to be determined by the method of 
moving averages, the average value for a number of years 
(or months, or weeks) is secured, and this average is taken 


MEASUREMENT OF TREND 261 


as the normal or trend value for the middle of the period 
covered in the calculation of the average. The following 
table shows the results secured when three-, five-, seven-, 
and nine-year moving averages are thus computed for New 
York Bank clearings for the period 1900-1923: 


TABLE 60 


New York City Bank Clearings, 1900-1923, and 3-, 5-, 7-, and 
9-year Moving Averages 


(in thousands of millions) 


Year| Original data Three-year F we-year Seven-year Nine-year 
moving av. moving av. moving av. | moving av. 

1900 $ 52.7 

1901 79.4 $ 69.5 

1902 76.3 73.9 $ 68.6 

1903 66.0 70.3 76.8 $ 77.4 
1904 68.6 76.1 81.9 82.3 $ 78.7 
1905 93.8 89.0 84.1 82.3 84.3 
1906 104.7 95.2 86.7 86.2 86.3 
1907 87.2 90.4 93.7 90.6 88.1 
1908 79.3 90.0 94.4 94.0 92.0 
1909 103.6 93 .4 92.0 95.0 94.8 
1910 Vine 9728 94.7 93.6 93.6 
1911 92.4 96.8 D7 93.0 94.3 
1912 100.7 95.9 93.6 97.5 102.3 
1913 94.6 92.8 96.3 105.5 113.2 
1914 83.0 96.1 109.7 116.9 121.6 
1915 110.6 Bley 125.0 129.2 137.0 
1916 159.6 149.2 141.8 148.5 153.7 
1917 177.4 Lile9 172.4 169.7 164.1 
1918 178.6 197.3 198.9 185.7 177.8 
1919 235 .8 219.2 205 .9 201.0 192.4 
1920 243 2 224.5 214.0 208.8 
1921 194.4 218.5 221.1 
1922 217.9 208 .8 
1923 214.0 


The three-year moving average for 1904 is the average of 
the figures for 1903-45, the five-year figure is the average 
of the years 1902-3-4—5-6. The other averages are com- 
puted in the same way. In each case the average is centered 
for the period included; that is, the average is taken to 


262 THE ANALYSIS OF TIME SERIES 


represent normal as of the middle of the given period. The 
employment of an odd number of years simplifies this 
centering process, though it is not essential that the number 
be odd. With an even number of years, the figure may be 
centered by taking a two-year moving average of the 
moving average. The five- and nine-year moving averages 
for the entire period are plotted, with the original data, 
in Fig. 54. 

It is obvious that the effect of the averaging is to give a 
smoother curve, lessening the influence of the fluctuations 
which pull the annual figures away from the general trend. 
The longer the period included in securing each average, 
the smoother is the curve secured, though there are other 
factors to consider in deciding upon the length of the period. 
Certain of these factors may be noted. 


CHARACTERISTICS OF Movine AVERAGES 


Given cyclical fluctuations about a uniform level or about 
a line ascending with a uniform slope, the length of the 
cycle and the magnitude of the fluctuations being constant, 
a moving average having a period equal to the period of 
the cycle (or to a multiple of that period) will give a straight 
line, a perfect representation of the trend. Under the 
same conditions a moving average having a period greater 
or less than the period of the cycle will give, not a straight 
line, but a new cycle having the same period as the original, 
but with fluctuations of less magnitude. The minima and . 
maxima of the cycles thus obtained, moreover, will not 
necessarily coincide with the minima and maxima of the 
original cycles. In general, when such a new cycle is obtained 
the magnitude of the fluctuations will be less the longer 
the period on which the average is based.! 

These propositions may be illustrated by the following 


1 The decrease in the magnitude of the fluctuations is not regular, however, 
but cyclical. 


MEASUREMENT OF TREND 263 


figures, arbitrarily chosen. In the first example five figures 
have been selected which repeat themselves in sequence, 
fluctuating about a common level. 


TABLE 61 
Illustrating the Application of Moving Averages 
(1) (2) (3) (4) (5) 


Moving average Mowing average 


Cyclical Moving average Moving average 


: of 10 ttems : of 8 items 
Data of 5 wtems (centered) of 3 ttems t paired) 
2 
6 53 
8 6} 8 
10 64 13 
5 63 53 63 
2 64 64 43 643 
6 64 64 54 63 
8 61 64 8 58 
10 6} 64 "2 bid 
5 64 63 53 63 
Q 63 65 43 61s 
6 64 64 53 63 
8 64 6} 8 58 
10 6£ 64 13 ST 
5 64 5 63 
2 64 43 6ié 
6 64 53 
8 64 8 
10 13 
5 


(The items in columns (3) and (5) have been centered by means of a 
moving average of 2 items.) 


The moving averages in columns (2) and (3) represent 
the data with the cycles completely removed. When the 
period of the average is not equal to the period of the cycle, 
or to a multiple of that period, the cycle is not ‘removed, as 
is apparent from the figures in columns (4) and (5). 

The conclusions suggested above hold when the cyclical 
fluctuations take place about any straight line. In the 
example which follows the foregoing data have been em- 
ployed but with a constant increment of 3. This is equiva- 


264 THE ANALYSIS OF TIME SERIES 


lent to superimposing the same cycles upon a line with a 
slope of + 3. 


TABLE 62 


Illustrating the Application of Moving Averages to a Series with 
Linear Trend 


(1) (2) (3) (4) (5) 
Moving average Moving average 


of 10 tems be i eel a of & tems 


Cyclical Moving average 


Data of 5 items (eames) of 3 items (cénderedl 
2 
9 8} 

14 124 14 
19 154 162 
U/ 184 172 182 
ilies 214 Q14 193 2113 
QA 244 244 Q34 242 
29 Qis Q74 29 268 
34 304 304 312 29it 
32 334 334 322 332 
32 364 364 344 3643 
39 394 394 383 393 
AA 4Q4 424 44 418 
49 454 454 462 4433 
47 484 482 472 482 
47 514 493 5133 
54 545 534 
59 574 59 
64 612 
62 


(The items in columns (3) and (5) have been centered by means of a 
moving average of 2 items.) 


The trend values, with the effect of the cycles completely 
removed, are secured by taking moving averages equal in 
period to the cycle or to a multiple of that period. The 
cycle persists, with the same period but with diminished 
amplitude, when the average is based upon a period not 
equal to that of the cycle, as is clear from the figures in 
columns (4) and (5). 

When these ideally simple conditions of constant period 
and amplitude do not exist, the moving average becomes 


MEASUREMENT OF TREND 265 


more ambiguous and its interpretation becomes less simple, 
If the period of the cycle varies, the selection of a period for 
the moving average is more difficult. In general, a period 
equal to or greater than the average length of the cycle 
is to be selected. An average having a shorter period will 
give a line which is marked by pronounced cycles, these 
cycles being reduced as the period covered in the calcula- 
tion of the average increases. 

When the amplitude of the cycle varies, the period being 
constant, a moving average with a period equal to the length 
of the cycle will give a line of trend marked by minor cycles. 
The amplitude of these secondary cycles will be a minimum 
when the period of the average is equal to the period of the 
cycle (or to a multiple of that period). When these last 
two irregularities are combined, and the data are character- 
ized by cycles of varying amplitude and of varying length, 
the moving average giving the most effective representation 
of the trend is that which has a period equal to the average 
length of the cycle, or to a multiple of that length. 

A new factor enters when the trend departs from linearity. 
If the underlying trend of a series is concave upward, a 
moving average will always exceed the actual trend value; 
if the reverse is true, and the trend is convex upward, a 
moving average will always be less than the actual trend 
value. 

These conditions are depicted in the following examples. 
In Table 63 figures are presented which give the values 
secured when a cycle of constant period and amplitude is 
superimposed upon a line of trend which is concave up- 
ward, i.e., increasing at a constantly increasing rate. If 
the moving average could completely eliminate the effects 
of the cycle, the values secured from the average would be 
equal to the average value of the five items (6%) plus the 
values of the function y = 2’. 


266 THE ANALYSIS OF TIME SERIES 


TABLE 63 
Illustrating the Application of Moving Averages to a Non-Linear Series 


(Increasing rate) 


(1) (2) (3) (4) (5) (6) 


Cyclical Col. (2) plus Moving average Proe trend 
a : Data col. (3) of 5 utems ealaies 
(a? + 6.2) 
0 0 Q 2 
1 1 6 7 
2 Zz 8 12 12.2 10.2 
3 9 10 19 Wp? 1522 
4 16 5 Q1 24.2 pea) 
5 Q5 2 Q7 Done one 
6 36 6 42 44.2 42.2 
7 49 8 57 EM) ie 55n2 
8 64 10 74 G29 70.2 
9 81 5 86 89.2 Slee 
10 100 2 102 108 .2 106.2 
11 121 6 127 129.2 12722 
12 144 8 152 TA? 150.2 
13 169 10 179 177.2 175.2 
14 196 5 201 204.2 20252 
15 295 2 297 233 .2 QSine 
16 256 6 262 264.2 262 .2 
il 289 8 297 297 .2 295 .2 
18 324 10 334 
19 361 5 366 


The values of the moving average are, in this case, always 
above the true trend values, a form of distortion that will 
always occur with a series of this type. 

In Table 64 are shown the results of superimposing the 
same cyclical values upon a line of trend which is convex 
upward, i.e., increasing at a constantly decreasing rate.’ 
In this case, a perfect method of eliminating the cycles 
would give results equal to the average value of the five 
items (6%) plus the values of the function y = Vz. 

In this case the moving average values are consistently 
too low. The discrepancy is most marked for the lower 
values of x, as the decrease in the rate of growth is most 
marked for these values. 


MEASUREMENT OF TREND 267 


TABLE 64 
Illustrating the Application of Moving Averages to a Non-Linear Series 


(Decreasing rate) 


(1) (2) (3) (4) (5) (6) 


ie Jz Cyclical Col. (2) plus Moving average ries ik 
Data col. (3) of 5 wtems Mes 
Va + 6.2) 
0 0 2 2.00 
1 1.00 6 7.00 
2 1.41 8 9.41 7.428 7.61 
3 Lights} 10 eS 7.876 7.93 
4 2.00 5 7.00 8.166 8.20 
5 2.24 ee 4.24 8.414 8.44 
6 2.45 6 8.45 8.634 8.65 
7 2.65 8 10.65 8.834 8.85 
8 2.83 10 12.83 9.018 9.03 
9 3.00 5 8.00 9.192 9.20 
10 3216 2 5.16 9.354 9.36 
11 roe 6 9.32 9.510 9.52 
12 3.46 8 11.46 9.658 9.66 
13 3.61 10 13.61 9.800 9.81 
14 3.74 5 8.74 9.936 9.94 
15 3.87 Z 5.87 10.068 10.07 
16 4.00 6 10.00 10.194 10.20 
lle 4.12 8 12A2 10.318 10.32 
18 4 24 10 14.24 
19 4.36 5 9.36 


Considerations previously reviewed have indicated that a 
moving average should, in general, be based upon a period 
at least equal to the period of the cycle, and preferably 
equal to some higher multiple of that period when the data 
are at all irregular. The longer the period covered, the 
greater the stability of the average. But when the underly- 
ing trend departs materially from the linear form, following 
a curve bending upward or downward, the error involved 
in the use of any moving average increases as the period of 
the average increases. If a moving average is used in such 
a case to measure the trend, the period of the average should 
be the shortest which will serve to average out the cycles; 
equal, that is, to the average length of one cycle. 


268 THE ANALYSIS OF TIME SERIES 


In practice, however, these various conditions are found 
in complicated combinations. ‘The fact that cycles vary in 
amplitude and length calls for a moving average based upon 
a fairly long period. The fact that the trend of the data is 
usually non-linear calls for a short period average to lessen 
the upward or downward distortion. A consideration of 
some importance in practical work is that a moving aver- 
age can never be brought up to date. The lag is less, of 
course, the shorter the period covered by the average. The 
selection of a period in a given case must rest upon a study 
of the actual data with these various considerations in 
mind. 

It has been assumed in the preceding discussion that the 
purpose of the moving average is the representation of 
secular trend. ‘The moving average may be used, also, in 
smoothing data for the purpose of eliminating random 
fluctuations. For this purpose a moving average based 
upon a period less than the average period of the cycle 
should be selected. 

We may return now to the problem relating to New York 
Bank Clearings. A study of the lines marked out by the 
different moving averages in Fig. 54 reveals significant 
differences between them. ‘The five-year average follows 
the graph of the original data most closely, as would be 
expected. The nine-year average marks out the smoothest 
line of trend, but, on the other hand, departs most widely 
from the data. This is particularly noticeable during the 
period from 1893 to 1898 and from 1911 to 1915, a condition 
due to the pronounced changes in the rate of growth of the 
series during these two periods. Except for these distortions 
the general trend seems to be most accurately represented 
by the nine-year average. 

In determining the relative merits of the different moving 
averages we are aided by a knowledge of the course of 
business during the period covered. The volume of New 
York bank clearings is a sensitive index of general business 


MEASUREMENT OF TREND 269 


conditions, responding immediately to changes in specula- 
tive and industrial activity. Major and minor business 
cycles are reflected in this series. Knowing the number of 
cycles through which business has passed during the last 
half century (1870-1920), we may determine which of the 
moving averages serves best as a standard from which to 
measure cyclical deviations. In this case we are practically 
working backward from a known result, a method not 
always available. 

If we take as a starting point in each cycle the year before 
business attains a normal condition after depression, the 
following cycles in general business activity may be dis- 
tinguished: } 


1870-1878 1904-1908 
1879-1885 1908-1911 
1886-1897 1911-1914 
1898-1904 1915- 


The cycles marked out by the three-year moving average 
are too numerous to enumerate. In fact, the deviations 
from this average are primarily accidental and minor 
fluctuations and should not be classed as cycles. Deviations 
from the seven- and nine-year averages mark out the fol- 


lowing cycles: 


Cycles of deviations from 
seven-year averages 


Cycles of deviations from 
nine-year averages 


1871-1878 1871-1878 
1879-1884 1879-1887 
1885-1888 1888-1897 
1888-1897 1898-1900 
1898-1900 1900-1903 
1900-1903 1904-1907 
1904-1907 1908-1914 
1908-1911 1915- 
1911-1914 

1915-1918 


1 These cycles are based upon an index computed by W. F. Ogburn and Dorothy 
Thomas; cf. Quarterly Publications of the American Statistical Association, Sept., 
1922, 324-340. 


270 THE ANALYSIS OF TIME SERIES 


Certain of the differences between the cycles thus de- 
termined are undoubtedly due to differences in the materials 
analyzed. Some of the other differences are worthy of note. 
Deviations from the seven-year average show three more 
cycles than either of the other series. The cycles from 
1885 to 1888 and from 1915 to 1918 are duplicated in 
neither of the other series, appearing as minor fluctuations. 
The cycle from 1898 to 1900 appears also in the deviations 
from the nine-year average, but not in the index used as 
the basis of comparison. The cycle from 1911 to 1914 
is duplicated in this index, but not in the deviations from 
the nine-year average. The great increase during the war 
years distorted the nine-year average in such a way as to 
smooth out a real cycle in the data. 

In summary, it may be said that, for the present data, a 
moving average with a period of less than seven years will 
not serve as a measure of trend. The nine-year average 
serves effectively, except for the latter part of the period 
when the rate of increase is changing sharply. The seven- 
year period appears to be somewhat short, emphasizing 
some fluctuations which are not of major importance. 
In general, the moving average has the prime advantage 
of flexibility. The representation of secular trend by 
mathematical curves frequently involves the breaking up 
of a period into two or three subdivisions, and the fitting 
of separate curves to each. This necessarily results from 
changing conditions and sharply changing rates of growth 
or decline. Where such changes occur the moving average 
has the merit of flexible adaptation to the new conditions 
and is often a more effective measure of secular trend than 
curves fitted with great labor. 

The simple moving average is sometimes modified by 
giving varying weights to the constituent items, weighting 
more heavily the items near the centers of the groups suc- 
cessively set up in computing the averages. Such a weighted 
moving average is called a progresstve mean. The coeffi- 


MEASUREMENT OF TREND 271 


cients of terms in the binomial expansion are used as weights. 
In the preceding examples, which called for a moving 
average of five items, the weights employed would be 1, 4, 
6, 4, 1. While the progressive mean has its uses in some 
types of analysis, it does not seem appropriate as a means 
of eliminating periodicities in time series, or of smoothing 
out random fluctuations in such series. 


REPRESENTATION OF SECULAR TREND BY MATHE- 
MATICAL CURVES 


For many types of data the secular trend may be repre- 
sented by a mathematical curve rather than by a line based 
upon a moving average. Thus, if the growth (or decline) 
is by constant absolute increments (or decrements) a 
straight line will serve as an exact representation of the 
trend. Or the growth may be by constant percentages, as 
in the case of capital increase, when a principal sum in- 
creases in accordance with the compound interest law. A 
curve of a definite mathematical form furnishes the best 
representation of this trend. In many series of economic 
statistics the data seem to conform to definite laws of 
growth, or decline, and where this is the case the task of 
analysis, interpretation and projection is materially assisted 
by securing a mathematical expression for the underlying 
law. In practically all cases, of course, there are departures 
from this law, deviations above and below the line of secular 
trend. These deviations, however, do not destroy the value 
of an equation which describes the underlying law of develop- 
ment. 

There is one fundamental difference between the moving 
average as a measure of trend and such mathematical curves. 
The former implies no definite “law” to which the data are 
assumed to conform. It is based upon the data as given; if 
the general trend! changes, the moving average follows the 
new trend. It is a flexible measure of trend, adapting itself 


272 THE ANALYSIS OF TIME SERIES 


to changing conditions, purporting to be nothing more 
than an empirical approximation to the drift of the series. 
Mathematical curves fitted to economic series are, in fact, 
nothing more than empirical approximations also, but in a 
somewhat different sense. They assume a “law” of change 
underlying the variations, accidental and otherwise, which 
show upon the surface of the data. It is an empirical law 
which is assumed, it is true, but nevertheless there is postu- 
lated a uniform and consistent trend capable of mathe- 
matical expression. If such an assumption is to have any 
validity it is essential that the period during which the law 
is supposed to hold be homogeneous, that there be no 
material changes in the conditions affecting the series being 
studied. Thus an equation, let us say, is secured for the 
trend of gold production. If a radical change should take 
place in methods of extraction the trend of gold production 
would change materially and the former equation would no 
longer apply. Data covering the period before and after 
such a change would not be homogeneous, and a single 
equation for the trend during the whole period should not 
be secured. 

In the practical approach to a problem involving the 
determination of secular trend the first task is the selection 
of the appropriate type of curve. This is perhaps the most 
difficult part of the work; certainly it is the part in which 
the element of personal judgment enters most directly. 
For there is no objective rule to follow, no fixed standard 
by which the most appropriate curve may be selected. 
Something more will be said on this subject after the 
characteristics of the chief types of curves and the methods 
of fitting them have been described. For the present it 
may be assumed that a curve similar to one of the types 
described in Chapter II, or of a related form, has been 
selected, and that we face the practical task of fitting it to 
the data. 


MEASUREMENT OF TREND 273 


Firtine a Straicnt Ling; tHe Mertuop 
or LEAST SQUARES 

If the data, when plotted, show a trend which can best 
be represented by a straight line the task of fitting is merely 
the determination of the constants in an equation of the 
form y= a+ bx. The values of a and 6b which will give 
a line following most closely the trend of the data are to be 
obtained. A simple illustration may serve to demonstrate 
the various methods which may be employed. Nine points 
Mis. 274 Oo, GS A555; 10:7 6, 939751038712; °9, 11) are 
plotted in Fig. 55. Our problem is the fitting of a straight 
line to these points. 


Soom 
nines 
Sagar 
pag lol 
meostets 
so 


Fic. 55. — Illustrating the Fitting of a Straight Line to Nine Points 


ee 
eapbee! | at 
Bb 


By inspection approximate values of a and b may be 
determined. A thread may be stretched through the 
points in such a direction that it seems to follow the trend 
as closely as possible. The slope of the line thus laid cut 


Q74 THE ANALYSIS OF TIME SERIES 


may be measured, the y-intercept determined, and the 
desired equation thus approximated. Obviously this is a 
loose and uncertain method, and the results obtained by 
different individuals may be expected to vary rather 
widely. There is one and only one straight line which fits 
most accurately the plotted, data. The constants for this 
line of best fit may be determined by the method of least 
squares. 

The theory upon which the method of least squares is 
based need not be detailed at length here. The argument 
may be briefly presented: A number of observation values 
of a certain quantity are found, and it is desired to obtain 
the most probable value of the quantity which is being 
measured. It is capable of demonstration that the most 
probable value of the quantity is that value for which the 
sum of the squares of the residuals is a minimum. (The 
‘‘residual’’ is a term for the difference between a given value 
and each of the observation values.) This is true of the 
arithmetic mean of the observation values. Thus, if a given 
distance be measured by a number of individuals, with 
varying results, the most probable value is the arithmetic 
mean of the different measurements. The process of 
computing the mean involves the following steps, which are © 
enumerated for the purpose of simplifying the later explana- 
tion. We seek a result, a statement of the most probable 
value of the distance being measured, which will take the 


form: M = (a constant). 


Let us say we have three approximations to this value: 


M = 5672 feet 
M = 5671 feet 
M = 5676 feet 


adding, 3M = 17019 feet 


Since there is but one unknown, M, it may be derived 
directly from this equation, and we have 


M = 5673 feet. 


MEASUREMENT OF TREND 275 


This is the value for which the sum of the squares of the 
deviations is a minimum. 

A similar problem arises when the relation between two 
variables is being measured. Our goal in this case is the 
equation which correctly describes this relationship. We 
have secured, however, varying results which do not agree 
precisely as to the constants in the equation of relationship. 
In other words, our plotted points do not all lie on the same 
line. What are the most probable values of the constants 
in the required equation? ‘The answer is analogous to that 
given when a single quantity was being measured. We seek 
the constants which, when the resulting equation is plotted, 
will give a line from which the deviations of the separate 
points, when squared and totaled, will be a minimum. 
Assuming that each pair of measurements gives an approxi- 
mation to the true relationship between the variables, we 
wish to find the most probable relationship, and this is given 
by the line for which the sum of the squared deviations is a 
minimum.} 

We have, in the present example, nine pairs of values for 
x and y. Substituting these values in the generalized form 
of the linear equation, y = a+ bz, we secure the following 
observation equations: 


8=a+1b 
4=a+ 2b 
6 =a+ 3b 
§ =a+4b 
10 =a+ 56 
9 =a+6b 
10 =a+ 7b 
12 =a+8b 
1l =a+9b 


Any two of these equations could be solved as simultaneous 
equations, and values of a and b secured. But these values 


1 Cf. Appendix A for a more detailed discussion of the method of least squares, 
together with a description of certain checks upon the calculations. 


276 THE ANALYSIS OF TIME SERIES 


would not satisfy the remaining equations. Our problem 
is to combine the nine observation equations so as to secure 
two normal equations, which, when solved simultaneously, 
will give the most probable values of a and 6. The first of 
these normal equations is secured by multiplying each of 
the observation equations by the coefficient of the first 
unknown (a) in that equation, and adding the equations 
obtained in this way. Since the coefficient of a in the 
present case is 1 throughout, the nine observation equations 
are unchanged. The second of the normal equations is 
secured by multiplying each of the observation equations 
by the coefficient of the second unknown (6) in that equa- 
tion, and adding the equations obtained. Thus the first 
equation is multiplied throughout by 1, the second by 2, 
and soon. The process of securing the two normal equations 
is illustrated below. 


TABLE 65 
Derivation of Normal Equations from Observation Equations 

3= a+ 1b 3= la+ 1b 
= a+ = Qa+t 4b 
6= a+ 3b 18= 3a+ 9b 
= a+ 4b 20= 4a+ 16d 
10= a+ 56 50= 5a+ 5b 
eg enGh 54= 6a+ 365 
10= a+ 7%) 70= Va+ 49b 
12= a+ 8b 96 = 8a+ 64b 
ll= a+ 9b 99= 9a+ 81b 
70 = 9a+ 456 418 = 45a + 2856 


The two normal equations are 


70 = 9a+ 45b 
418 = 45a + 285d. 


It remains to solve these equations for a and b. By multi- 
plying the first equation by 5 and subtracting it from the 


a 68 
second, a may be eliminated; a value of 60 oF 1.133, is 


found for b. Substituting this value in either of the equa- 


MEASUREMENT OF TREND Q77 


tions, a value of 2.111 is secured for a. The equation to 
the best fitting straight line is, therefore, 


y = 2.1114 1.1332 


It is not necessary to write out and total the equations, 
as is done above. In the actual application of the method 
it is necessary only to insert the proper values in the two 


equations ! 
Z(y) = na + 62 (a) 
Z(xy) = ad(x) + b2 (2). 


The symbols employed have the following meanings: 


2(y): the sum of the values of y. 
D(x): the sum of the values of z. 
2(a«y): the sum of the products of the paired 2’s and y’s. 
2(x?): the sum of the squares of the values of z. 
mn: the number of pairs of values; the number of points 
plotted. 


The work of computation is facilitated by a tabular 
arrangement similar to the following: 


TABLE 66 
Computation of Values Required in Fitting a Straight Line 


x Yy xy digs 
1 3 3 1 n=9 
2 4 8 4 2 (a) = 45 
3 6 18 9 D(y) = 70 
4 5 20 mel G D(a?) = 285 
5 10 50 25 D(ry) = 418 
6 9 54 36 
i 10 70 49 
8 12 96 64 
9 11 99 81 
45 70 418 285 


The two desired normal equations are secured by suh- 
stituting these five values in the type equations given 


1 General rules for the formation of normal equations are given in Appendix A, 


278 THE ANALYSIS OF TIME SERIES 


above. It will be noted that the results are identical with 
those obtained from the observation equations. 

When the equation to the best fitting straight line has 
been obtained the values of y corresponding to given values 
of 2 may be computed and compared with the observed 
values. The table which follows presents the results 
secured: 


TABLE 67 
Comparison of Observed and Computed Values of a Variable Quantity} 
x y y d a xd 
(observed) (computed) 
1 3 3.24 ee 0597 — 24 
y) 4 4.32 =e 1427 te 
3 6 5.54 + .48 2390 + 1.48 
4 5 6.64 —1.64 2.7041 — 6.54 
5 10 Tye 49.92 “4coset ae 
6 9 8.92 + .08 .0079 + 53 
" 10 10.04 <1 04 0020 poe 
8 12 Viet + .82 6760 +6.52 
9 11 12.34 —1.33 1.7190 — 11.8 
Total 0.0 10.4885 0.0 


The sum of the deviations of the plotted points from the 
line is zero. The sum of the deviations when each is multi- 
plied by the corresponding value of wx is also zero. The 
accuracy of the actual calculations involved in fitting may 
be tested in this way. The sum of the squares of the devia- 
tions, 10.4885, is a minimum. Any change in the value of 
a or b would give a line from which the sum of the squared 
deviations would exceed 10.4885. 


Firtine a Strraicut Ling; Speciat CaAsEs 


The simultaneous solution of the two normal equations 
will give, in any case, the most probable values of a and b. 
The processes of calculation may be simplified in certain 
special cases, not infrequently encountered in handling 


1 The common fractions are retained in certain columns in order that the sum 
of the deviations may be exactly zero. 


MEASUREMENT OF TREND 279 


economic data. If the 2’s are consecutive numbers, as they 
always are when an unbroken time series is plotted, the 
origin may be taken at the median value. When the 
number of observations is odd, this will be the middle item, 
of course. The value of 2(x) will then be zero, and the 
normal equations become 


2Z(y) = na 
Z(ary) = b2(2?). 


Thus if a time series extends, by years, from 1900 to 1920, 
the origin may be taken at 1910, the value of x correspond- 
ing to 1909 being — 1, to 1911, +1, and so on. The 
solution for values of a and b is rendered much easier when 
the data can be disposed in this way. When there is an 
even number of years the same process is possible, time 
(the z-variable) being measured in units of one half year. 

Again, when the values of x are consecutive positive 
numbers starting at zero, the values of 2(x) and of 2 (a?) 
may be easily determined. The sum of the first n natural 


n(n + iby 


numbers is equal to Thus the sum of the numbers 


10(10 + 1) 


from 1 to 10 is or 55. This term may replace 


(x) in the normal equations. Similarly, the sum of the 
squares of the first m natural numbers is equal to 
Qn? + 3n?+ 


: Thus the sum of the squares of the numbers 


250+ 7545 _ 
6 
may replace D(x?) in the normal equations, and we have 


Z(y) = na + (nae). 


from 1 to 5 is equal to 55. This expression 


Z(ey) = (ee - ~) +(e 7"). 


280 THE ANALYSIS OF TIME SERIES 


It is sometimes easier to work from equations in this form 
than in the form first given. The data for time series may 
be handled in this way, the years being numbered con- 
secutively, beginning with 1. 


Fittrina A CuRVE OF THE POTENTIAL SERIES 


The discussion above has been confined to the case of 
linear trend. Such a type of curve frequently gives an 
excellent fit, but in many cases it fails accurately to fit the 
data. This difficulty is sometimes overcome in practice by 
breaking a series up into segments, and fitting a separate 
line to the data for each of these periods. Where there is an 
actual break in the series, the period as a whole lacking 
homogeneity, this practice may be justified, but when the 
period is essentially homogeneous the whole concept of 
secular trend is violated by this process of subdividing and 
fitting separate lines. In many cases where a straight line 
will not fit, a curve of the potential series may represent 
the trend accurately. The general process of fitting such 
a curve may be briefly described. 

The generalized form of the equation of the type desired 
isy=a+ bx4+ cx?+ dz?+... An equation of this form 
does not, of course, represent a curve of the parabolic 
type, but in ordinary usage that term is applied to the 
potential series. If carried to the second power of 2 it is 
called a second degree parabola; if to the third power, a 
third degree parabola, ete. For ordinary purposes such a 
curve should not be carried beyond the second or third ~ 
power of x. If carried to the second power there are three 
unknowns, and three normal equations must be solved 
simultaneously in securing the required values. 

The procedure is similar to that outlined for the linear 
case. Each observation equation is multiplied by the co- 
efficient of the first unknown in that equation, and the 
resulting equations are totaled to give the first normal 


MEASUREMENT OF TREND 281 


equation. The process is repeated for the two other un- 
knowns, and the three normal equations thus obtained are 
solved for a, b and c. The results are the most probable 
values of these three constants. The following are the 
general forms which the three normal equations take: 


Z(y) = na + bY (x) + cD(z’). 
D(xy) = aX (a) + b2 (2?) + cD (2°). 
D(a?y) = ad (a?) + b2 (a?) + cD (a*). 


As an example of the process, the calculations involved in 
fitting a second degree parabola to the points 1, 2; 2, 6; 
alc 4,9; 5, 10; 6, 11;_7, 11; 8, 10; 9, 9 may be outlined. 
It is of the greatest practical importance in curve fitting, 
as in all extensive calculations, that the work be laid out 
and carried on in a definite and systematic fashion, with 
each step definitely related to the preceding and succeeding 
operations. Checks should be introduced wherever possi- 
ble, as mathematical errors creep into even the most careful 
work. A tabular arrangement is generally advisable, each 
operation being revealed and each set of results clearly 
presented. 

The data in the present case may be arranged in the 
following form: 


TABLE 68 
Computation of Values Required in Fitting a Second Degree Parabola 
x y ry aig xy 
1 Q Q 1 2 = 9 
2 6 12 4 24 (2) = 45 
3 q Q1 9 63 D(x?) = 285 
4 8 32 16 128 Z(G) = 2,025 
5 10 50 25 250 2) (a nlossoo 
6 11 66 36 396 D(y) = 74 
oo tll "7 49 539 L(ay) = 421 
S £105 80 ~.64 . 640 D(a) = 2,771 
9 


45 74 421 285 2,771 


282 THE ANALYSIS OF TIME SERIES 


When the 2’s are consecutive integers beginning with 1, 
as in the present case, the values of Z(x), 2(a), Z(2') 
and (a2) may be secured from prepared tables. 

Substituting these values in the equations given above, 
the following normal equations are secured: 

74 = 9a + 45b + 285c. 
421 = 45a + 285b + 2,025c. 
2,771 = 285a + 2,025b + 15,333. 


When these equations are solved simultaneously the 
following values are secured for the three constants: 


a = —.929. 
b = + 3.523. 
ec = —.267. 


The equation of the desired curve is 
= — .929+3.523x% — .26727. 


This curve and the nine given points are plotted in Fig. 56. 
If the values of x are consecutive, as in the present 
example, the work of computation is lightened if the mid- 
value is taken as origin. In this case 2(2) and D(a’) are 
equal to zero, and the normal equations become 
Ly = na + cD (2?). 
D(axy) = b2(2?). 
Z(a?y) = ad (x?) + cD (z4). 


When a third degree parabola of the form y = a+ bx + 
cx? + dx’ is to be fitted to data, four constants must be 
determined, and four normal equations are necessary.’ 
These are of the following form: 

Z(y) = na + bz (x) + cL (a?) + dd(z*). 
D(xy) = ad (x) + 62D (a?) + cX(2*) + dd (a4). 
D(a?y) = ad (x?) + bX (x*) + cX(a*) + dd(a°). 
2 (ay) = aX (x?) + b2 (at) + cX(ax*) + dZ(z*). 

1 Cf. Table XXVIII, Pearson, Tables for Statisticians and Biometricians. 
Cambridge University Press, 1914. 


MEASUREMENT OF TREND 283 


The solution for four or more constants involves a con- 
siderable amount of arithmetical calculation, and there is 
some question as to the advisability of representing secular 
trend by equations of this type. With a sufficient number 
of constants a curve may be fitted which will follow every 


Fic. 56. — Illustrating the Fitting of a Second Degree Parabola to Nine Points 


variation in the data, but such a curve could hardly be taken 
to represent the long time trend.!_ Minor departures from 
a simple and uniform trend, linear or otherwise, are to be 
expected with economic data, but, if a real trend exists, 


1 Regarding the employment of potential series of the type indicated for repre- 

senting empirical curves, Steinmetz states that their use is justified: 

1. If the successive coefficients a, b, c . . . decrease in value so rapidly that 
within the range of observation the higher terms become rapidly smaller 
and appear as mere secondary terms. 

%. If the successive coefficients follow a definite law, indicating a convergent 
series which represents some other function, as an exponential, trigono- 
metric, etc. 

8. If all the coefficients are very small, with the exception of a few of them, and 
only the latter ones thus need to be considered. Cf. Steinmetz, Engineer- 
ing Mathematics, 214-215. 


284 THE ANALYSIS OF TIME SERIES 


extreme departures from a fairly simple form are rare. If 
such departures are due to pronounced changes in conditions 
no single line of trend is likely to be satisfactory, and it is 
advisable to break the period up into parts, with a separate 
line of trend for each part. | “Empirical curves,” says 
Steinmetz, “can be represented by a single equation only 
when the physical conditions remain constant within the 
range of the observations.” Though this statement relates 
to the fitting of curves to data from the physical sciences, 
it applies equally well to economic data. 


A TypicaL ProBteM: DETERMINING THE SECULAR 
TREND oF Bustness FAILURES 


The procedure of fitting certain types of curves to simple 
data has been illustrated in the preceding sections. Before 
proceeding to a discussion of slightly different forms, it 
may be helpful to insert a concrete example at this point. 
A straight line, a second degree parabola and a third degree 
parabola are to be fitted to the figures for business failures 
in the United States, from 1897 to 1921. The three curves 
are to be fitted in order that the results may be compared 
and significant differences noted. 

To facilitate the calculations, the mid-year of the period, 
1909, is taken as the origin. Certain of the values needed 
in the normal equations are computed in the following 
table. The values. of x represent the time factor, while 
the values of y are the corresponding numbers of business 
failures: : 


MEASUREMENT OF TREND 285 


TABLE 69 
Business Failures in the United States, 1897-1921 
Computation of Values Required in Fitting Lines of Trend 


(1) ) (3) (4) (5) (6) 
Year & y ry ay asy 
(No. of failures) 
1897 — 12 13,083 — 156,996 1,883,952 — 22,607,424 
1898 — 11 11,615 — 127,765 1,405,415 — 15,459,565 
1899: — 10 9,642 — 96,420 964,200 — 9,642,000 
1900 —9 9,912 — 89,208 802,872 — 7,225,848 
1901 —8 10,648 — 85,184 681,472 — 5,451,776 
1902 —7 9,973 — 69,811 488,677 — 3,420,739 
1903 —6 9,775 — 58,650 351,900 — 2,111,400 
1904 —5 10,417 — 52,085 260,425 — 1,302,125 
1905 —4 9,967 — 39,868 159,472 — 637,888 
1906 —3 9,385 — 28,155 84,465 — 253,395 
1907 —2 10,274 — 20,548 41,096 — 82,192 
1908 —1 14,066 — 14,066 14,066 — 14,066 
1909 0 LMR Ye At eae Mioeacteoca: samc som itil Patt oes cc 
1910 1 11,588 11,588 11,588 11,588 
1911 Q 12,679 25,358 50,716 101,432 
1912 3 13,832 41,496 124,488 : 373,464 
1913 4 14,553 58,212 232,848 931,392 
1914 5 16,780 83,900 419,500 2,097,500 
1915 6 19,035 114,210 685,260 4,111,560 
1916 7 16,498 115,486 808,402 5,658,814 
1917 8 13,073 104,584 836,672 6,693,376 
1918 9 9,331 83,979 755,811 6,802,299 
1919 10 5,515 55,150 551,500 5,515,000 
1920 11 8,463 93,093 —-1,024,023 11,264,253 
1921 12 19,982 - 239,784 2,877,408 34,528,896 
Totals 0 301,958 + 188,084 15,516,228 + 9,881,156 
n = 25 D(z?) = 0 
D(x) = 0 2(a*y) = 9,881,156 
2Z(y) = 301,958 2(a*) = 121,420 
Z(ry) = 188,084 21a?) =O 
2 (a?) = 1,300 2 (a5) = 13,471,900 


D(a2?y) = 15,516,228 


286 THE ANALYSIS OF TIME SERIES 


Fitting A StraicHt Line To BUSINESS 
FAILURES 
Since the origin is at the mid-point of the x’s, the equa- 
tions to be solved in securing the values of the required 
constants are of the form 
Z(y) = na 
L(ry) = b2(2?). 
Inserting the given values in the formulas, we have 
301,958 = 25a 


188,084 = 1,300) 
from which 


ll 


a = 12,078 
b = 144.7. 


The equation of the line of best fit is, therefore, 
y = 12,078 + 144.72. 


Firtina A SreconD DEGREE PARABOLA 
For a second degree parabola the normal equations to be 
solved are of the form 
Z(y) = na + cd (a*) 

2 (vy) = 62 (2?) 

Z(a?y) = ad (a?) + cD (x4). 
When the appropriate values are substituted, the following 
equations are secured: 

301,958 = 25a + 1,300c 

188,084 = 1,300b 

15,516,228 = 1,300a + 121,420c. 


Solving for the constants, 
a = + 12,258 
b= 4144.7 
c= — 3.45. 


MEASUREMENT OF TREND 287 


The required equation is: 
y = 12,258 + 144.7¢ — 3.4527. 


Firtinc a Tuirp Drcrer PARABOLA 
The normal equations to be solved in securing the values 


of the constants for a third degree parabola, when the 
origin is at the mid-point of the 2’s, are of the form 


Z(y) = na + cz (2?) 
Z(xy) = b2 (x?) + dz (z*) 
2(x?y) = ad(a?) + c2(z*) 
2D (ay) = b2(a*) + dz (2°). 


The equations to be solved simultaneously are: 


301,958 = 25a + 1,300c 
188,084 = 1,300b + 121,420d 
15,516,228 = 1,300a + 121,420c 
9,881,156 = 121,420b + 13,471,900d. 


The values derived from these equations are: 


a = + 12,258 
b = + 482 

ce = — 3.45 

d = — 3.61. 


The required equation is: 


y = 12,258 + 482x — 3.4527 — 3.612%. 


The original data and the lines of trend given by these 
three equations are plotted in Fig. 57. The actual values 
of y (business failures), the computed or normal values in 
each of the three cases and the percentage deviations of 
actual from normal in each case are shown in the following 
table: 


288 THE ANALYSIS OF TIME SERIES 


TABLE 70 
Business Failures in the United States, 1897-1921 


Actual Values, Normal Values as derived from three lines of trend, and 
Percentage Deviations of Actual from Normal 


Deviations 
Normal Values | Normal Values From PF ney ae 
Year Actual Normal Values Second Degree Third Degree Straight Secon uA 
Values | Straight Line parabola Papabots bee Degree Degree 
Parabola | Parabola 
1897 | 13,083 10,342 10,025 12,215 + 26.5% | + 30.5% + 7.0% 
1898 | 11,615 10,486 10,249 11,844 + 10.8 + 13.2 + 2.3 
1899 9,642 10,631 10,466 10,703 — 9.3 — 8.0 — 10.0 
1900 9,912 10,776 10,676 10,272 — 8.0 — 7.0 — 3.8 
1901 | 10,648 10,920 10,880 10,030 —2.5 — 2.0 + 6.5 
1902 9,973 11,065 11,076 9,953 — 9.9 — 10.0 + 0.2 
1903 9,775 11,210 11,266 10,022 — 12.8 — 18.2 —2.5 
1904 | 10,417 11,354 11,448 10,213 — 8.3 — 9.0 +2.1 
1905 9,967 11,499 11,624 10,506 — 18.3 — 14.3 — 5.1 
1906 | 9,385 11,644 11,793 10,878 — 19.4 — 20.4 — 13.7 
1907 | 10,274 11,789 11,955 11,313 — 12.9 — 14.1 —9.2 
1908 | 14,066 11,933 12,110 11,776 + 17.8 + 16.0 + 19.4 
1909 | 11,872 12,078 12,258 12,258 -—1.7 — 3.2 — 3.2 
1910 | 11,588 12,223 12,400 12,783 — 5.2 — 6.6 — 9.1 
1911 | 12,679 12,367 12,533 13,179 +2.5 +1.0 — 3.8 
1912 | 13,832 12,512 12,660 13,575 + 10.5 + 9.2 +1.9 
1913 | 14,553 12,657 12,781 13,900 + 15.0 + 138.9 + 4.6 
1914 | 16,780 12,801 12,895 14,131 + 31.0 + 30.0 + 18.6 
1915 | 19,035 12,946 13,002 14,246 + 47.0 + 46.2 + 83.5 
1916 | 16,498 18,091 13,102 14,225 + 26.0 + 25.9 + 15.9 
1917 | 13,073 13,236 13,194 14,045 —1.2 —0.9 — 7.0 
1918 9,331 13,380 13,281 13,685 — 30.2 — 29.8 — 31.8 
1919 5,515 13,525 13,3350 13,123 — 69.1 — 58.7 — 57.8 
1920 8,463 13,670 13,432 12,338 — 38.0 — $7.0 — 31.4 
1921 | 19,982 13,814 13,498 11,307 + 44.6 + 48.0 + 76.8 


' CoMPARISON OF LINES oF TREND 


It is clear from the preceding discussion, and from 
Fig. 57, that widely different results may be secured when 
the secular trend in a given case is represented by different 
types of curves. In the present instance the straight line 
and the second degree parabola follow much the same course, 
but the third degree parabola marks out quite a different 
trend. The example was selected for the purpose of em- 
phasizing these differences. Since a “‘normal”’ for each year 
is measured by the ordinate of the line of trend for that 
year, the use of different lines of trend may give quite 


MEASUREMENT OF TREND 289 


different standards of normality. Moreover, the cycles 
registered by the deviations of the actual data from the 
line of trend may vary with the type of curve selected. 
That this is true in the present case is shown by the per- 
centage deviation figures in Table 70. 


Second Degree eee to Original Data _/| 


re Zak 
ae panan 
SSS 


GA Str traight LineFitted to 
riginal Data 


Third i Parabola Fitted to Original Data 


Number of Failures 


aD ie BG ge a GT ae ag) ae. 46)! Leta 


Fig. 57. — Business Failures in the United States, 1897-1921, with Three Lines 
of Trend 


The deviations from the straight line and the second 
degree parabola show a close correspondence, but the 
deviations from the third degree parabola are markedly 
different. In four years the signs of the deviations from 
the third degree parabola are different from those of the 
deviations from the other curves, and in many of the other 
years the magnitudes of the deviations vary widely. An 
extension of this third degree curve beyond the limits of 
the data would give quite illogical results, however. While 
projection in any case is risky, the straight line and the 


290 THE ANALYSIS OF TIME SERIES 


second degree parabola constitute measures of trend which 
are better for this purpose than the third degree parabola. 

More will be said in a later section as to the factors which 
determine the choice of a curve. With respect to the selec- 
tion of a curve in the present instance several points should 
be noted. If the actual data fall consistently above or 
below a line of trend for a considerable period, it is probable 
that the fit is not good. For business failures both the 
straight line and the second degree parabola fail to follow 
the actual data closely, there being a single nine-year 
period from 1899 to 1907 during which the data are con- 
tinually below normal, as represented by these two lines 
of trend. The third degree parabola follows the data much 
more closely, and, so far as one may judge by inspection, 
gives a better fit within the limits of the data. 


Tuer Usre or LoGARITtHMs IN CurRVE FITTING 


The family of curves described above represents a simple 
and very useful type. Perhaps of even greater general 
utility, in the analysis of time series, are curves of a semi- 
logarithmic type. The advantages of plotting many series 
of data on semi-logarithmic or “‘ratio”’ paper were explained 
in an earlier section. A fundamental virtue of this type of 
plotting is that it presents a true picture of relative varia- 
tions, of ratios between magnitudes. Relations of this type 
are ordinarily of primary interest in the analysis of economic 
data, and therefore the determination of trends should _ 
proceed on the same basis. 

In doing so, we can make use of a group of curves of the 
same general form as those already described, the one 
difference being that log y takes the place of y throughout. 
That is, the straight line form is log y = a+ bx, while the 
general form for the potential series is log y = a+ ba + cx? 
+dx?+ ... The curves secured may be constructed on 
arithmetic paper, plotting the natural a’s and the logarithms 


MEASUREMENT OF TREND 291 


of the y’s, or natural values of both 2’s and y’s may be 
plotted on semi-logarithmic paper, the logarithmic scale ex- 
tending along the y-axis. The latter is the simpler method. 
To illustrate the procedure, the steps involved in fitting a 
curve of the type log y = a+ bx + cz? will be shown. The 
-trend of petroleum production in the United States from 
1908 to 1922 is to be determined. The values needed in the 
normal equations are derived from the following table: 


TABLE 71 
Petroleum Production in the United States, 1908-1922 
Computation of Values required in fitting Line of Trend 


Year x -y log y x-log y x?-log y 
Production (in 
millions of bbls.) 

1908 —% 178.5 2.25164 — 15.76148 110 .33036 
1909 —6 183.2 2.26269 — 13.57614 81.45684 
1910 —5 209.6 2.32139 — 11.60695 58 .03475 
1911 —4 220 .4 2.34321 — 9.37284 37 .49136 
1912 —-—83 222.9 2.34811 — 7.04433 21.13299 
1913 -—2 Q48 . 4 2.39515 — 4.79030 9.58060 
1914 -1 265.8 2.42456 — 2.42456 2.42456 
1915 0 281.1 QeGASSGe - 9 Seis | aun 
1916 1 300.8 2.47828 2.47828 2.47828 
1917 2g 335.3 2.52543 5.05086 10.10172 
1918 3 355.9 2.56133 7.65399 22 .96197 
1919 4 378.4 2.57795 10.31180 41. 24720 
1920 5 442.9 2.64631 13 .23155 66.15775 
1921 6 472.2 2.67413 16 .04478 96 . 26868 
1922 q 557.5 2.74624 19 .22368 134.56576 

36 .99528 9.41834 694 . 23282 

n =15 2(x) = 0 
2 (log y) = 36.99528 D(x?) = 280 
Y(a-log y) = 9.41834 2 (a*).= 0 
2 (a?-log y) = 694. 23282 D(a) = 9352 


The three normal equations to be solved are of the form 
Z(log y) = na + bax + cLaz? 
ZX (ax -log y) = aa + bXx? + cLa* 
D(x?-log y) = ada? + bla? + chat, 


292 THE ANALYSIS OF TIME SERIES 


Substituting the given values we have 
36 .99528 = 15a + 280c 
9.41834 = 2805 
694. 23282 = 280a + 9352c. 


Solving for the constants, 
a = + 2.450508 
b= + .033637 
c= + .0008488. 


The equation to the desired curve is, therefore, 
log y = 2.450508 + .033637x + .000848827, 
with origin at 1915. 

The substitution in this equation of the value of zx repre- 
senting any given year will enable the logarithm of the 
trend or normal value to be calculated. The trend value 
in natural numbers may then be determined. In the table 
which follows the normal value for each of the years covered 


is given, together with the percentage relation of actual to 
normal. 


ll 


TABLE 72 


Trend of Petrolewm Production in the United States with Comparison 
of Actual and Trend Values 


(Line of Trend fitted to Logarithms of Production Figures.) 


y (actual) y (computed) Percentage rela- 

Year x Production (in Log of Trend Trend value (in tion of actual 

millions of bbls.) millions of bbls.) to trend 
1908 —7 178.5 2.256639 180.6 98.8 
1909 —6 183.2 2.279242 190.2 96.3 
1910 —5 209.6 2.303543 20152 104.2 
1911 —-—4 220.4 2.329540 213.6 103.2 
1912 —38 222.9 2.357236 227 .6 97.9 
1913 -—2 Q48 4 2.386629 243 .6 102.0 
1914 —-1 265 .8 2.417720 261.6 101.6 
1915 0 2811 2.450508 282 .2 99.6 
1916 1 300.8 2.484994 305.5 98.5 
1917 Q $35.3 2.521177 332.0 101.0 
1918 3 355.9 2.559058 36253 98 .2 
1919 4 378 . 4 2.598636 396.9 95.3 
1920 5 442.9 2.639913 436 . 4 101.5 
1921 6 472 2 2.682886 481.8 98.0 
1922 7 557.5 2.727557 534.0 104.4 


MEASUREMENT OF TREND 293 


The points representing the actual production, together 
with the line of trend, are plotted in Fig. 58. The graph of 
the derived equation gives a very good representation of 
the trend in the present instance. 

The advantages of this type of curve are evident when 
the above results are compared with those secured when a 


50 oe 


Actual Production 


a AE $2 re) ra ai 
Fie. 58. — Production of Crude Petroleum in the United States, 1908-1922, 


with Line of Trend Fitted to Logarithms 


similar curve (a second degree parabola) is fitted to the 
natural numbers. The values of the constants are secured 
by the method of least squares, and the following equation 
is derived: 
y = 279.387 + 24.262x + 1.6502? 

with origin at 1915. The trend values secured from this 
equation, together with the actual production figures and 
the percentage relations between them, are given below. 


294 THE ANALYSIS OF TIME SERIES 


TABLE 73 


Trend of Petroleum Production in the United States with Comparison 
of Actual and Trend Values 


(Line of Trend fitted to Natural Numbers.) 


y (actual) y (computed) Percentage 
Vue . Production T rend value relation 
(in millions (in millions of actual 
of bbls.) of bbls.) to trend 
1908 -—7 178.5 190.403 OS 
1909 —6 183.2 193.215 94.8 
1910 —5 209.6 199 .327 105.2 
1911 — 4 220.4 208 .739 105.6 
1912 — 3 222.9 221.451 100.7 
1913 —2 : Q48 4 237 .463 104.6 
1914 -—1 265.8 256.775 HO3E5 
1915 0 281.1 Q79 . 387 100.6 
1916 1 300.8 305 .299 98.5 
1917 Q 335 .3 334.511 100.2 
1918 3 355 .9 367 .023 97.0 
1919 k 378 .4 402.835 93.9 
1920 5 442.9 441.947 100.2 
1921 6 472.2 484.359 97.5 
1922 uf 557.5 530.071 105.2 


This curve, together with the actual production figures, 
is plotted in Fig. 59. 

Each of the equations used to represent the trend of 
petroleum production in this example contains three con- 
stants, so comparison of the results is legitimate. Inspection 
of the two graphs would lead to the conclusion that the curve 
fitted to the logarithms of the data gives a better representa- 
tion of the trend than the curve fitted to the natural num- 
bers, but an accurate measure is preferable to such a general 
impression. Such a measure is afforded by the root-mean- 
square deviation. This may be computed exactly as in 
dealing with a frequency distribution, except that devia- 
tions are measured from the line of trend instead of from 
the arithmetic mean. 

In the case of the curve fitted to the natural numbers, 
the root-mean-square (standard) deviation is found to be 
12.46, while for the curve fitted to logarithms of the pro- 


MEASUREMENT OF TREND 295 


duction figures this measure has a value of 9.44. (The 
unit in each case is one million barrels of petroleum.) It 
is clear that the dispersion about the latter curve is dis- 
tinctly less than about the curve based on natural numbers; 
Millions of 
Barrels 


& Ss 


aise eee = 


Fig. 59. — Production of Crude Petroleum in the United States, 1908-1922, with 
Line of Trend Fitted to Natural Numbers 


the logarithmic curve is by far the better measure of trend 
in the present instance.1 


1 The standard deviations given above were computed from the actual devia- 
tions, in millions of barrels of petroleum. The superiority of the logarithmic 
curve is even more pronounced, as would be expected, if these measures are com- 
puted from percentage deviations. The standard deviation about the curve fitted 
to the natural numbers is 4.01 on this basis, while the standard deviation about 
the logarithmic curve is but 2.69. 

It should be noted that the testing of lines of trend by comparing standard 
deviations is only valid when the equations to the curves being compared contain 
the same number of constants. A curve may be made to pass through every 
point, if a number of constants be introduced equal to the number of points plotted. 
The standard deviation in this case would be zero, but such a curve would not be 
in any sense a measure of trend. 


296 THE ANALYSIS OF TIME SERIES 


The two families of curves described in the preceding 
sections meet most of the needs of the economic statistician. 
The trend in most time series may be described by curves 
of the potential series, fitted either to natural numbers or 
to the logarithms of the data (that is, to the logarithms 
olf the y values; time, the z-variable, is treated in terms of 
natural numbers in fitting both the above types of curves). 
These classes constitute flexible and widely applicable 
curve forms. Attention may be called to several other 
curve types which have been applied less extensively to 
time series, but with favorable results in particular cases. 

Curves of the ordinary parabolic type (y = az’) are not 
generally applicable to economic data in the form of time 
series, as their use involves the treatment of the time 
variable as a geometric series. Such a curve, it will be 
recalled, becomes a straight line on double logarithmic 
paper. Yet if a curve of this form serves accurately to 
describe the trend of a given series, its use is justified, 
empirically. 

Such curves may be fitted most readily by employing 
logarithms and using an equation of the linear type. The 
equation 

y= ar” 
becomes, in logarithmic form, 
log y = log a + b log z. 


The two normal equations needed in fitting such a curve are 
of the form : 
Z(log y) = n log a + b2 (log z) 
Z(log x-log y) = log a (log x) + bd (log x)?. 


By substituting the values computed from the data, these 
equations may be solved for log a and 6, just as in fitting 
an ordinary straight line.! 

1 A very useful table of the sums of the logarithms of the natural numbers from 


1 to 100 is included as an appendix to Medical Biometry and Statistics, by Raymond 
Pearl. (Philadelphia, Saunders, 1923.) 


MEASUREMENT OF TREND 297 


The simple exponential curve (y = ab”) may be widely 
used in the analysis of time series. In logarithmic form 
this becomes 

log y = log a + (log b)z. 


The two normal equations to be solved in fitting such a 
curve are 
Z(log y) = n log a + log b2(z) 
2(«-log y) = log aZ(x) + log b2(2?). 


This is the straight line form of the family of semi-logarith- 
mic curves described in an earlier section. By determining 
the antilogarithms of the two constants secured from the 
solution of the normal equations, the derived equation may 
be put in the exponential form. 

Some use has been made, in the interpretation of economic 
statistics, of the Gompertz curve, the equation to which 
was originally developed in the actuarial field. The equa- 
tion is 

y = ab”. 
Its use in the analysis of economic statistics has been based 
upon the argument that there is a general law of growth 
characteristic of population increase, and that this same 
type of growth is found in industries whose products are a 
direct function of the growth of population.! 

A somewhat similar curve of growth described by the 


equation 
b 


ew" +¢ 


y=d+t 


has been employed by Raymond Pearl and Lowell J. Reed 
in forecasting population growth.? This curve has been 
found to describe the trend of certain economic data. 

1 The characteristics of this curve are described in “‘Law of Growth in Fore- 
casting Demand,” by Raymond B. Prescott, Journal of the American Statistical 
Association, Dec. 1922, 471-79. 

2 “Predicted Growth of Population of New York and Its Environs.” R. Pearl 
and L. J. Reed. Committee of Plan of New York and Its Environs, 1923. 


298 THE ANALYSIS OF TIME SERIES 


Suort Cur Mernops or Fittinc CURVES: 
THE Mrtruop or AVERAGES 


The fitting of curves by the method of least squares is 
appropriate when a high degree of accuracy is required. 
Other methods of fitting curves may sometimes be em- 
ployed when only approximate values are required. 

In fitting a straight line by the method of averages the 
series is divided into two equal or nearly equal parts. From 
the observations falling in the two groups two equations of 
the following form are secured: 


Z(y) = na + b2(z). 


These equations are solved simultaneously, and the values 
of a and 0 obtained. This is equivalent to finding, for each 
half of the series, the point having codrdinates equal to the 
arithmetic averages of the x’s and y’s, and connecting the 
two points thus secured. This line will not be identical, 
ordinarily, with the line of least squares. 

The method may be illustrated by using the data of 
business failures from Table 69. Since the series covers 
25 years, 13 observations may be included in the first group 
and 12 in the second. The two equations are 


140,629 = 13a — 786 
161,329 = 12a + 78b. 
Solving, 
a = 12,078 
b = 210 


and the equation to the line of trend is 
y = 12,078 + 2102. 
By the method of least squares the equation 
y = 12,078 + 144.72 
was obtained. The latter, of course, is the equation to the 


best fitting straight line. The short method in this case 
gives an equation varying considerably from the correct 


MEASUREMENT OF TREND 299 


result. This may be expected when the original data de- 
part as widely from the line of trend as in the present 
case. 

Tue Mersop or SELECTED Points 


Having determined the form of curve to be fitted to cer- 
tain data, approximate values of the constants may be 
secured by selecting a number of points equal to the number 
of constants required. The codrdinates of these points are 
substituted in the corresponding number of equations, and 
these are solved for the constants. Thus, if a third degree 
parabola is to be fitted to the data of business failures, four 
points must be selected. Drawing an experimental curve 
by inspection will assist in the selection of appropriate 
points. The following points may be taken as a basis for 
calculation (cf. Table 69 and Fig. 57 relating to business 
failures). 


4 y 
— 10 10,000 
—5 10,000 
+ 5 14,000 
+10 13,000 


Substituting these values for x and y in equations of the form 
desired, the following are secured: 

10,000 = a — 106 + 100c — 1,000d 

10,000 = a — 56 + 25c — 125d 

14,000 = a + 5b + 25c + 125d 

13,000 = a + 106 + 100c + 1,000d. 
Solving these equations simultaneously, the following values 
are obtained for the four constants: 


a = + 12,166 
b = + 483 
c= —6.7 
d = — 8.8. 


The required equation is, therefore, 
y = 12,166 + 4832 — 6.72% — 3.32%. 


300 THE ANALYSIS OF TIME SERIES 


The difference between this equation and the one secured by 
the method of least squares will be noted. 

There is a considerable margin of error in the employment 
of these short methods with ordinary economic data, be- 
cause of the fact that most economic series do not adhere 
closely to a given line of trend. That is, economic data 
correspond to physical measurements characterized by 
large errors of observation. When physical data follow a 
definite law, and the errors of observation are small, the 
use of the method of averages or the method of selected 
points will involve no material error. But with economic 
data, which frequently depart widely from any given line 
of trend, the use of short cut methods is in general in- 
advisable. 


Tue SELECTION OF A CURVE TO REPRESENT TREND 


Various types of curves which may be fitted to represent 
the trend of economic data over a period of time have been 
described. But which of these many types is to be selected 
in a given case? Which will give the best standard of 
normality for each of the years covered? Several references 
to this problem have been made in the preceding sections, 
but no general principles have been laid down. And, in 
fact, no general principles can be evoked to answer this 
fundamental question. There is no absolute test of goodness 
of fit in such cases. It is largely a matter of personal judg- 
ment as to the type of curve which best represents the 
trend in a given instance, and experience must play a 
dominant part in such judgments. But there are certain 
general considerations which are of assistance in selecting 
the appropriate type of curve, and which, in some cases, 
enable a single type to be selected with confidence that it is 
the best. 

1. The first step in the selection of a curve type is the 
plotting of the data. When this has been done, it is fre- 


MEASUREMENT OF TREND 301 


quently possible by inspection to determine the appropriate 
form. The data may be plotted in four different combina- 
tions, of which the first two are of chief importance in 
dealing with economic material. 


a. Natural z, natural y. (That is, plot the given figures 
on ordinary arithmetic paper.) 

b. Natural x, log y. (Plot the z’s on the natural scale, 
and plot the y’s on the logarithmic scale; i.e., use 
semi-logarithmic paper.) 

c. Natural y, log x. (Plot on semi-logarithmic paper, 
with the 2-scale logarithmic.) 

d. Log y, log x. (Plot on double logarithmic paper.) 


If in any of these cases a straight line trend is secured, a 
type of equation which plots as a straight line under the 
given conditions (cf. Chapter II) would be selected. If 
a linear equation cannot be secured some other simple 
type may be suggested by the plotted data. In studying 
such graphs for the purpose of selecting a curve to represent 
trend, one should be familiar with the curves representing 
all the simpler equations.! 

2. The appropriate curve may be determined by a study 
of the relations between the two variables, x andy. In the 
simplest cases the following relations hold: ? 

If, when the values of x are arranged in an arithmetic 
series, the corresponding values of y form a geo- 
metric series, the relation is of the exponential type, 
described by the equation 


y = ab’ 


1 A variety of curves will be found plotted in the following: 
Empirical Formulas, T. R. Running, N. Y., Wiley, 1917. 
Engineering Mathematics, C. P. Steinmetz, N. Y., McGraw-Hill, 1917. 
Graphical and Mechanical Computation, Joseph Lipka, N. Y., Wiley, 1918. 
2 It will be recalled in connection with this discussion that an arithmetic series 
changes by a constant absolute increment, while a geometric series changes by a 
constant percentage. 


302 THE ANALYSIS OF TIME SERIES 


b. If, when the values of x are arranged in a geoinetric 
series, the corresponding values of y form a geo- 
metric series the relation is of the simple parabolic 
or hyperbolic type, described by the equation 


y = ax” 

c. If, when the values of z are arranged in an arithmetic 
series, the first differences of the corresponding y’s 
are constant, the relation is of the straight line type, 
described by the equation 


y= at be 


The differences between successive y values, when 2’s are arranged in an arith- 
metic series, are termed “first differences” or first order “differences” and are 
represented by the symbol Ay. The differences between successive first differ- 
ences are called “second differences” and are represented by the symbol A2y. 
Differences of higher order are similarly derived. The following table illustrates 
the formation of differences : 


x y Ay. My “Ly 
1 11 
g 40 = 32 
61 12 
3 101 44 
105 12 
4 206 56 
161 12 
5 367 68 
229 12 
6 596 80 
309 12 
" S05 ge Cee 
B -is0gr ee tee ae 
Oa ISL! 94, eel 16 
10 2432 


d. If, when the values of x are arranged in an arithmetic 
series, the nth differences of the corresponding y’s 
are constant, the relation between the variables 
is described by an equation of the potential series 
carried to the nth power of x; that is, by an equa- 
tion of the type 


y=a+ber+er?+da+... qu 


Thus, in the example given above, in which the third 
differences are constant, the relation between «x 
and y would be described by an equation of the 


form 
y=at bx + ca + dz’. 


MEASUREMENT OF TREND 303 


These and similar tests, which may be applied to a great 
many other types of data, are described in the books by 
Running and Lipka referred to at the end of this chapter. 
It should be emphasized that, when one is selecting a curve 
to use in the analysis of economic data, he will rarely, if 
ever, find these tests to be met perfectly. This would 
happen only when the curve chosen passed through all the 
plotted points. But data in a given case will generally 
approximate some one of the conditions described above, 
and the appropriate type of curve will be indicated. 

3. If the study of the original data does not render a 
definite decision possible, several types of curves may be 
fitted to the data and the decision made by comparing the 
results, as was done in the two cases already cited, business 
failures and petroleum production. If the equations to 
the curves being compared contain the same number of 
constants, a comparison of the root-mean-square deviations 
about the curves furnishes a conclusive and valid test of 
the closeness of the fit within the limits of the data. 

The root-mean-square deviation may be readily computed 
by making use of the following relationship 


2(d) = Uy?) — aX(y) — ba (ay) — cB (ay) —. . . 


where > (d?) is the sum of the squares of the deviations about 
the line of trend. (The derivation of this equation is ex- 
plained in Appendix A, in which a generalized form is given.) 
If the equations do not contain an equal number of con- 
stants, a test of this sort is invalid and the comparison can 
only be made by inspection. Personal judgment as to the 
curve which represents the trend most accurately must be 
the basis of the decision in such cases. 

It should be remembered that the closeness of fit within 
the limits of the data is not by itself a final criterion. An 
equation could be secured, having a number of constants 
equal to the number of points, which would give a curve 
passing through every point plotted, yet such a curve 


304 THE ANALYSIS OF TIME SERIES 


would not necessarily represent the trend. The concept 
of a trend is of a regular, smooth underlying movement, 
from which there are deviations, but which marks the long 
time tendency of the series. In general, therefore, the curve 
should be of simple form, if it is to be consistent with the 
concept of secular trend. This does not mean, however, 
that a complex trend can be represented by a simple curve 
which fails to conform to the plotted data. 

4. An important question to be answered before the form 
of curve can be selected relates to the limits within which 
the line of trend is to be used. If it is to be used only 
within the limits of the plotted data (1.e., for interpolation) 
one set of considerations governs the choice of a curve. If 
it is to be projected beyond the limits of the data, used as a 
basis for the determination of normal during a subsequent 
period, other considerations enter. In the former case a 
reasonable fit to the data is the sole requirement; in the 
latter case it is necessary, in addition, that the trend of the 
projection be logical, and consistent with the past record. 
This requirement was emphasized in discussing the trend 
of business failures. 

The fact should be clearly recognized that projection, or 
extrapolation, represents a guess, justified only on the 
assumption that a proper line of trend has been fitted and 
that the same conditions which affected the series in the 
past will prevail in the future. A change in conditions, the 
introduction of new elements, renders the projection in- 
valid. When dealing with economic statistics, moreover, . 
it is ordinarily impossible to tell, except in retrospect, when 
a change has taken place. Conclusions drawn from the 
projection of a line of trend are always subject to error, 
therefore. In practical statistical work such projections 
are made, and are justified on the ground that the most 
probable course in the future is that which prevailed in the 
past. Projections into the distant future are, of course, 
subject to wider margins of error than short-time projections. 


MEASUREMENT OF TREND 305 


Lines of trend should be revised from time to time, there- 
fore, as new data become available. 

When a projection is to be made a simple curve with few 
constants is to be preferred to a more complicated one. A 
third or fourth degree parabola may give an excellent fit 
to the data in a given case, but the projection of such curves 
is inadvisable. It is well to remember, as Perrin has pointed 
out, that a curve suitable for interpolation may not be at 
all adapted to extrapolation. 

It seems to be true, in general, that simple curves fitted 
to the logarithms of the y’s give more reliable results when 
projected than curves fitted to the natural numbers. In 
an interesting discussion of this point, Karl G. Karsten ! 
argues that phenomena characterized by a uniform rate 
of change are more likely to maintain their trend than 
phenomena marked by a uniform amount of change. It is 
the semi-logarithmic curves, of course, which best measure 
rates of change. 

5. When the object of fitting a line of trend is the study 
of deviations connected with business cycles, another test 
of the accuracy of fit is afforded by a comparison of the 
deviation cycles with actual business conditions, as measured 
by a number of other series. In general, that line of trend 
which gives cycles consistent with the cycles in general 
business is to be preferred. The curves fitted to the data of 
business failures may be tested in this way. The broad 
cycles through which business moved during the period 
covered are reflected in the deviations from the straight 
line and the second degree parabola, but the minor cycles 
are not. These two lines of trend show business failures to 
have been below normal during the whole period from 1899 
to 1907. When the trend is represented by the third degree 
parabola, failures slightly above normal are found in 1901 
and in 1904. These represent phases of depression in minor 
cycles, though the whole period from 1899 to 1907 was one 

1 Charts and Graphs, 423-425. 


306 THE ANALYSIS OF TIME SERIES 


of general prosperity. The curve of higher degree, it is 
apparent, follows the data more closely, registering minor 
deviations as cycles. Whether this is desirable in a 
given case depends upon the nature of the study being 
made. 

6. It is frequently true that no one curve will fit a given 
series during the entire period it is desired to study. This 
may be due to changes in conditions which cause the trend 
to be altered. Thus the trend of wholesale prices was 
downward, in a direction well represented by a straight line, 
from the close of the Civil War to 1896. From 1896 to the 
beginning of the World War the trend was upward, and could 
be described by a second degree parabola. Similar changes 
occur in many economic series. By breaking the entire 
period up into sections, appropriate lines of trend may be 
fitted to the several periods thus marked off. This process 
may be carried to a quite illogical extreme, however. The 
concept of trend is of a gradual, long-term change, and the 
breaking up of a series in order to fit a number of trend 
lines is contrary to the whole conception. It may be 
justified upon occasion when a real change in conditions 
occurs, but in all cases the attempt should be made to 
represent the trend during the whole period by a single line. 


REPRESENTATION OF TREND BY A RELATED 
STATISTICAL SERIES 


It is sometimes impossible to represent the trend of a . 
series by a mathematical curve. This is likely to be the 
case when data are available over but a short period, or 
when the trend of an old series has been quite clearly 
changed by the introduction of new factors. Such a problem 
is presented by commodity prices since the war. The pre- 
war trend has been definitely altered, and it is impossible 
to say with any degree of certainty what the future trend 
will be. Faced with this difficulty, Warren M. Persons has 


MEASUREMENT OF TREND 307 


developed a novel method of measuring cyclical fluctua- 
tions by using one series as the base of a related series. 

In the field of prices the two series employed are Brad- 
street’s index of wholesale prices and Persons’ price index 
of business cycles. The constitution of each of these 
indices was explained in an earlier section.2. Bradstreet’s 
index, based upon prices of 96 commodities, and Persons’ 
index, based upon prices of 10 commodities, move simul- 
taneously, in general, but the movements of the latter are 
much more violent. Accordingly, since Persons’ object is 
the measurement of cyclical fluctuations alone, Bradstreet’s 
index is used as a base for the more sensitive 10-com- 
modity index. Deviations are measured from this base 
precisely as from an ordinary line of trend. 

The same method is employed in securing a base line for 
money rates. A stable series, the yield on 10 prime rail- 
road bonds, is used as base for each of two more sensitive 
series which move concurrently, the rate on 60-90 day 
paper and the rate on 4-6 months paper. In each case a 
slight adjustment was necessary in order that the “crossing 
points”’ (i.e., the points at which the line representing each 
series crossed its base line) might coincide with the “crossing 
points” of other series used in the general index of business 
conditions. These adjustments are explained in the refer- 
ences cited. 


DEFLATION AS A STEP IN ANALYSIS 


Many series of economic data are expressed in monetary 
units, in dollars, pounds, or francs, and such series are 
subject to distortion because of changes in the price level. 
Thus the value of building and engineering contracts 
awarded in twenty-seven states in 1913 amounted to 858 
millions of dollars; in 1922 the value of contracts awarded 
in the same territory amounted to 3344 millions of dollars.* 


1 Cf. Review of Economic Statistics, Harvard Economic Service, April, 1923, 


73-74, July, 1923, 192-194. 
2 Cf. Chapter VI. 3 Figures compiled by F. W. Dodge & Co. 


308 THE ANALYSIS OF TIME SERIES 


Was the volume of building in 1922 almost four times that in 
1913? It wasnot. The value of building contracts awarded 
in any year depends not only upon the actual volume of 
construction but also upon the costs of building materials 
and building labor, and these increased materially from 
1913 to 1922. If we wish to measure the change in the 
volume of building alone, these values must be corrected 
for the increase in building costs between 1913 and 1922. 
Such a process is termed deflation. 

The selection of an appropriate deflating index is the 
central problem in such cases. In the chapter on index 
numbers, money wages were deflated by a cost of living 
index to secure a measure of the changes in real wages. 
We might deflate the value of building contracts by an 
index of building material prices, at wholesale, in the 
following manner: 


(1) ey er . (3) Me (4) , 
ue of Building ze ated Value of 
Year Contracts nee oes ilding Building Contracts 
(millions of dollars) eee pad (millions of dollars) 
1913 858 100 858 
1922 3344 168 1990 


The values in the last column are secured by dividing the 
actual values in column (2) by the corresponding index 
numbers in column (3), and multiplying by 100. This has 
the effect of reducing the actual value to the value in terms 
of prices during the base period, in the present instance 
1913. The actual volume of building is shown by the 
deflated figures to have increased in 1922 to slightly more — 
than double the 1913 volume. 

Better results could be secured in this case by taking 
account of wage changes, as well as changes in material 
costs, in deflating. The American Telephone and Tele- 
graph Company uses such a combined index in deflating 
the value of building contracts awarded. The application 
of this index is illustrated in the following table: 


MEASUREMENT OF TREND 309 


TABLE 74 
Actual and Deflated Values of Building Contracts Awarded 


Value of Building | A.T. & T. index of | Deflated Value of 
Contracts Awarded | Construction Costs | Building Contracts 


(millions of dollars)| (1914=100) (millions of dollars) 

1914 

MANUAL Yas oe oh ole 100 ileal 
February...... 39.1 101 38.7 
Marchi. «, 58.9 102 Oe 
JN Ve oe 79.7 101 78.9 
Mavens ees. « 72.0 100 1250 
Ce Ay oars 53 81.8 100 81.8 
AJ 0 ee 72.0 100 W260 
ENCIAGQTIS] BS re HG S&S 100 Wao 
September..... 47.1 100 47.1 
October....... 53.4 99 53.9 
November..... 45.5 99 46.0 
December...... 42.3 98 43.2 
1915 

VABUATY ne - 43.3 99 43.7 
BeDrUALY a2 48.8 99 49.3 
Marching asc 75.6 99 76.4 
BRE a ows ss 16.5 100 16.5 
I ae aes, 77.1 102 75.6 
JUNE) seeker 92.2 104 88.7 
ee ois 94.7 104 91.1 
PANO US Um eyigs) e+ 17 90.4 102 88.6 
September..... 82.0 103 79.6 
Octobers 5.2... 88.6 104 Sone 
November..... 88.0 107 8272 
December...... 82.9 112 74.0 
1916 

SaMUALY.c ere 62.8 118 53.2 
IRebriary....- - 66.3 121 54.8 
WERENeemocuoe 94.5 124 76.2 
Asti all ot eeOetgeene 100.9 127 79.4 
IW EK. Bie elector 131.4 127 103.5 
ACs ead ee 140.7 128 109.9 
UL yennenastgees i oss 114.4 125 91.5 
ANT QICIE) 5 Gua e 127.0 122 104:1 
September..... 132.2 123 107.5 
October. - 151.4 125 Wako 
November..... 122.4 128 95.6 
December...... 112.9 136 83.0 


\ 


310 THE ANALYSIS OF TIME SERIES 


Tasie 74— Continued 


Value of Building |A. T. & T. index of| Deflated Value of 
Contracts Awarded | Construction Costs | Building Contracts 
(millions of dollars)| (1914 = 100) (millions of dollars) 


1917 

JanUalye eee 90.8 138 65.8 
February...... 95n2 141 67.5 
Mancha LS 2B 144 92.2 
Aprile taccmes 148.5 148 100.3 
Mayne: eee 157.6 155 101.7 
UWS eno soos 206.5 160 WY 
Ty ss eee eas 159.2 171 93.1 
AAUCUStase 3. oe 165.6 170 97.4 
September..... 122.5 166 73.8 
Octoberss = 154.5 154 100.3 
November..... 94.3 155 60.8 
December...... 90.8 155 58.6 
1918 

Januanyein oie 161.6 156 103.6 
February...... 137.3 161 85.3 
March ae nssne cr 115.3 162 olen 
RAN 2 128.9 164 78.6 
Wives norm centri 120.4 167 72.0 
UNC Rhea inn 248 .2 171 145.1 
Julyeet ets aces 153.0 176 86.9 
AMICUS ene eee 146.4 180 81.3 
September..... 124.5 180 69.2 
October....... 166.1 183 90.8 
November..... 1303 183 (Al ee 
December...... 57.3 186 30.8 
1919 

Jauuian yawn 54.1 186 29.1 
February...... 98.7 186 O81 
WHEN, S48 boa 121.8 186 65.5 
Rimiltae en wes 188.8 178 106.1 
Midian amet: 234.7 178 131.9 
UNC abe cre sice 285 .4 180 158.6 
lyse raoreecceen Sie 183 173.6 
ANWPAVEHE 5b cco 295.1 194 152.1 
September..... 228.7 199 114.9 
October....... 307 35 201 153.0 
November..... 220.6 207 106.6 
December...... 226.7 221 102.6 


MEASUREMENT OF TREND 311 


TaBLE 74 — Continued 


Value of Building |A.T. & T. index of | Deflated Value of 
Contracts Awarded | Construction Costs | Building Contracts 
(millions of dollars)} (1914 = 100) (millions of dollars) 


1920 

Januaryaee are 226.1 27 99.6 
February...... 205.3 240 85.5 
IMEaT Chives eats 302.2 248 121.9 
pris 5... 304.9 254 120.0 
Mia Viger cette 2 264.9 264 100.3 
JUNC Aes eee lor 260.1 266 97.8 
UL at tyrcetia conte 204.5 265 V7.2 
PARTS US I ness ct cei 202.7 267 75.9 
September..... 182.2 261 69.8 
October....... 178.6 256 69.8 
November..... 132.9 250 53.2 
December...... 100.1 236 42.4 
1921 

WINMAT ys 2 coe: 111.6 231 48 .3 
February...... 102.4 223 45.9 
Mancha se. 163.8 221 74.1 
in re 221.3 212 104.4 
1 Ei aoe 240.9 206 116.9 
NUNC ere etn 226 .2 201 112.5 
MATS Reb iste ens. 2 212.2 197 107.7 
ANIgUste. . je a2 2 220 .4 194 113.6 
September..... 244.9 188 130.3 
October... 2.55. 222.4 186 119.6 
November..... 190.8 187 102.0 
December...... 198.3 186 106.6 
1922 

ANUATY weet se 166.3 185 89.9 
February...... 172.7 185 93.4 
Marcha. arcs ues 293.4 187 156.9 
Moyne eon OO ote 353.0 186 189.8 
WER Ae Score 361.8 188 192.4 
UMGN ete oe 342.4 195 175.6 
ths Se ea 350.1 199 175.9 
August........ 322.0 197 163.5 
September..... Q71.5 201 135.1 
Octobermanecice: 253.1 201 125.9 
November..... 243.4 197 123.6 
December...... 214.2 196 109.3 


312 THE ANALYSIS OF TIME SERIES 


TasuE 74 — Continued 


Value of Building |A. T. & T. index of | Deflated Value of 
Contracts Awarded | Construction Costs | Building Contracts 
(millions of dollars)| (1914 = 100) (millions of dollars) 


1923 

AWWEI A. 0.6 6 oo 218.7 202 108.3 
February. ..... 229759 209 110.0 
March... +> 338 .2 212 159.5 
OM A oon ace 362.9 ANE Gee 
Wanye; Sc cua tar 373.9 227 223 
aS OMy hn, bar Pree 323 .6 O11 146.4 
Anite ee ae Q74 2 219 125.2 
ANWERUGEs 5 6 Go 08e 253.1 219 115.6 
September..... Q5Sno lig 116.8 
Octoberaen. ...: 319.9 216 148.1 
November..... 289.3 211 TS 7a 
December...... 267.9 PAG 124.6 


The actual and deflated values are plotted in Fig. 60. 

Most value series are affected by price changes, and it is 
generally advisable to correct for this factor before further 
analysis is attempted. Each case presents a new problem, 
for no general deflating index is suitable to all series. The 
index of wholesale prices compiled by the United States 
Bureau of Labor Statistics has been used extensively in 
deflating economic data expressed in dollar values, but this 
index is not at all appropriate in many of the cases in which 
it has been employed. It would be absurd, for instance, to 
deflate money wages by an index of wholesale prices. The 
deflating index employed should be a measure of price 
changes as they affect the series being deflated. 

An interesting example is afforded by the method adopted 
by the statistical department of the Federal Reserve Bank 
of New York in deflating bank clearings outside New York. 
The deflating index employed was a combination of the 
following series, each weighted as indicated: 


Indexvof rents: us on. ch tee ee eens 1 
Index of wholesale commodity prices. ..... 2 
Index of wages inci. meno eee Omen + 


Index: of costof livingy... yn gee eee 


MEASUREMENT OF TREND 313 


It was found that the index so constructed served quite 
effectively to eliminate the effect of price and wage changes 
upon outside bank clearings. The deflated series has been 
used as a measure of variations in the volume of trade in 
the United States. 


1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 


Fic. 60. — Comparison of Actual and Deflated Values of Building Contracts 
Awarded, by Months, 1914-1923 


The deflation of a value series is in general a first step in 
the study of that series. The way is then open for further 
analysis by methods explained in the present and succeeding 
chapters. 


314 THE ANALYSIS OF TIME SERIES 


REFERENCES 
Bowtey, A. L. Elements of Statistics (132-169). 
Daviss, G. R. Introduction to Economic Statistics (100-130). 
Karsten, Karu. Charts and Graphs (477-481). 
Lipxa, JosepuH. Graphical and Mechanical Computation. 
Moors, H. L. Forecasting the Yreld and the Price of Cotton 
(28-37). 
Prart, Raymonp. Medical Biometry and Statistics (832-341). 
Persons, W. M. Indices of Business Conditions. Review of 
Economic Statistics. Prel. Vol. I, 1919 (1-107). 
Runnine, T. R. Empirical Formulas. 
Steinmetz, C. P. Engineering Mathematics (209-255). 


(References to books dealing with the method of least squares are given at the 
end of Appendix A.) 


CHAPTER VIII 


THE ANALYSIS OF TIME SERIES: MEASUREMENT 
OF SEASONAL AND CYCLICAL FLUCTUATIONS 


The measurement of secular trend is but one of the 
problems connected with the analysis of a series in time. 
Such series, it has been pointed out, are subject to periodic 
fluctuations, seasonal and cyclical in character, and these 
fluctuations are generally more important in their effects 
upon business than is the long-time trend. Our present 
concern is with methods of isolating such periodic varia- 
tions. The series which follows, characterized by extreme 
seasonal variation, may be used to illustrate methods of 
measuring such seasonal swings. 


TABLE 75 


Average Egg Prices Received by Producers in the United States, 
1910-1921 3 


(Prices are given in cents per dozen; averages are based 
upon prices on the first of each month.) 


PENDATY:» 6,010.50.) 5 «10 30.5 | 30.4 | 29.5 | 26.8 | 30.7 | 31.6 | 30.6 | 37.7 | 46.3 | 57.2 | 64.8 | 61.1 
February......... 28.9 | 22.1] 29.1 | 22.8 | 28.4 | 29.2 | 26.8 | 35.8 | 49.4 | 48.3 | 56.9 | 49.6 
Matchins sic octitias\e 22.9 | 16.5 | 24.5 | 19.4 | 24.2 | 21.3 | 21.2 | 33.8 | 40.4 | 33.1 | 46.6 | 29.2 
MADE stare sietste eieiz'sox0 18.6 | 14.9] 17.8] 16.4] 17.6 | 16.6 | 17.9 | 25.9 | 31.2 | 34,3 | 38.8 | 20.4 
IM BY, oi ates arsre soe 18.6 | 14.7] 17.1] 16.1 | 16.8 | 17.1 | 18.1 | 30.0 | 31.0 | 86.8 | 37.4 | 20.2 
Ue pele asters ereienscaa 18.3] 14.5] 16.7] 16.9 | 17.3 | 16.6 | 19.0 | 31.1 | 29.8 | 38.6 | 37.0 | 19.4 
DULY aii siele.s, 0150 18.2] 14.2] 16.7] 17.0] 17.6 | 16.8 | 19.7 | 28.3 | 80.7 | 36.8 | 36.7 | 22.0 
UZ UB seals o.6, sn revels 17.6] 15.5 | 17.4] 17.2] 18.2] 17.0 | 20.7 | 29.8 | 34.4 | 39.3 | 40.0 | 26.6 
September........ 19.4] 17.4] 19.1] 19.5 | 21.0] 18.7 | 23.3 | 33.2 | 36.4 | 41.0 | 44.2 | 30.4 
October seis cose 22.4 | 20.0 | 22.0 | 23.4 | 28.5 | 22.3 | 28.1 | 37.4 | 41.6 | 44.7 | 50.1 | 34.2 
November........ 25.3 | 23.5 | 25.9 | 27.4 | 25.3 | 26.3 | 32.2 | 39.4 | 47.2 | 54.0 | 56.9 | 44.2 
December........ 29.0 | 28.7 | 29.7 | 83.0 | 29.7 | 30.6 | 38.1 | 43.3 | 55.0 | 61.9 | 65.0] 51.1 

Average........ 22.5 | 19.4 | 22.1 | 21.8 | 22.5 | 22.0 | 24.6 | 33.8 | 39.5 | 43.8 | 47.9 | 34.0 


1 Data compiled by U.S. Department of Agriculture. 
315 


316 THE ANALYSIS OF TIME SERIES 


ARITHMETIC AVERAGES OF Montuuy Items 


A comparatively simple and straightforward method of 
measuring seasonal variations is afforded by the use of 
arithmetic averages of the actual items, by months. Since 
the data cover 12 years, there are 12 January items, 12 
February items, etc. When the items, by months, are 
averaged the results shown in column (2) of Table 76 are 


secured. 
TABLE 76 


The Construction of Indices of Seasonal Variation from Arithmetic 
Averages of Actual Prices 


(1) (2) (3) (4) 


: : Arithmetic av F 
Arithmetic averages of : Se i a Indices of 
Month monthly egg prices Pee at ee Seasonal 


(cents per dozen) (nls pe. Variation 

January ere. SAD Sel 40.73 138.8 
February...... 35.61 36.38 123.9 
Marches se, Q7 76 28 .34 96.5 
April 5.2 sree 22.53 22.92 78.1 
Mavens tie 32 22.82 23.01 78.4 
JUNGae Geese 22.93 22.93 Uksiall 
nliyatet ecrerctate 22.89 22.70 Wed 
ANUS 5 0060 ¢ Q4 47 24.08 82.0 
September..... 26.97 26.39 89.9 
October....... 30.81 30.04 102.3 
November..... 35 .63 34.66 Ub key 
December..... 41.26 40.10 136.6 

AVETASC Nase ok Peden 29 .357 100.0 


The simple arithmetic averages in column (2) give some . 
indication of the extent of seasonal variation in the prices 
of eggs. From January there is a sharp falling off to April, 
and then a gradual increase, broken slightly in the month 
of July, reaching a maximum in December. These figures, 
reduced to a percentage basis, would serve as a measure of 
seasonal changes were it not for the effect of secular trend 
upon the averages. From 1910 to 1920 egg prices increased 
materially, and this increase, as well as the seasonal varia- 


MEASUREMENT OF SEASONAL FLUCTUATIONS 31% 


tion, is reflected in the monthly averages. On the average, 
taking secular trend alone into account, every February 
exceeds the preceding January, every March exceeds the 
preceding February, etc., and, accordingly, the monthly 
averages must be corrected for this factor if seasonal 
variations alone are to be measured. 

When a straight line is fitted to the twelve yearly averages 
of egg prices from 1910 to 1921, inclusive, it is found that 
the average annual increase in price during this period was 
2.32 cents per dozen. The average monthly increase was 
one-twelfth of this, or .193 cents. That is, taking no account 
of seasonal variations, the trend of egg prices was such that 
every month prices tended to exceed the prices of the 
preceding month by .193 cents. The average of each 
month’s prices, therefore, is affected by this same amount. 
Were the force of secular trend alone in operation, the 
average of the February items would exceed the average of 
the January items by .193 cents, the March average would 
exceed the February average by the same amount, ete. 
The effect of this factor may be easily eliminated by taking 
any one month as base, and making the necessary correc- 
tions in the figures for all other months. With January as 
base we would subtract from the February average .193 
cents, from the March average .386 cents, from the April 
average .579 cents, etc. With December as base there 
should be added to the November average .193 cents, to 
the October average .386 cents, etc. It is advisable for 
the present purpose to take a base near the middle of the 
year. Thus with June as base we subtract .193 cents from 
the July average, .386 cents from the August average, etc., 
while we add .193 cents to the May average, .386 cents to 
the April average, etc. When these corrections are made the 
averages given in column (3) of Table 76 are secured. 

These figures represent the effect of seasonal fluctuations 
alone, during the twelve years included in the study. For 
convenience in use it is advisable to reduce them to a 


318 THE ANALYSIS OF TIME SERIES 


percentage form, with the average monthly figure (29.357) 
as the base. These percentages are given in column (4) 
of Table 76. The use of such indices as these in the 
analysis of time series is demonstrated below. 


Tue Meruop or Link RELATIVES 


Another method of measuring seasonal variation has been 
developed by Warren M. Persons. This method has been 
used in several important investigations, notably in the 
construction of indices of business conditions by Persons. 
The first step in this method involves the computation of 
link relatwes, by expressing the figure for each month of the 
period covered as a percentage of the figure for the preceding 
month. Thus January, 1910, is expressed as a percentage 
of December, 1909; February, 1910, as a percentage of 
January, 1910, ete. The following are the link relatives 
thus computed from the data of Table 75. 


TABLE 77 
Link Relatives of Monthly Egg Prices 


Month 1910 |1911 |1912 |1913 |1914 |1915 |1916 |1917 |1918 | 1919 | 1920 | 1921 
aRMUALY,..« silo eiew cele” 107. 4}104.8/102.8}) 90.2) 93.0/106.4]/100.0} 99.0/106.9}104.0/104.7] 94.0 
Hebruaryicy. cis se1aetdtere’ 94.8] 72.7] 98.6] 85.1] 92.5] 92.4) 87.6) 95.0/106.7| 84.4] 87.8] 81.2 
Mla relists cintelsveeorsre'e 79.2] 74.7] 84.2] 85.1] 85.2) 72.9) 79.1) 94.4) 81.8] 68.5] 81.9] 58.9 
LSNwthncaadooaues lec 81.2] 90.3] 72.7] 84.5] 72.7) 77.9] 84.4) 76.6) 77.2]103.6} 83.3] 69.9 
MANE aieisiecscleis sielery exe 100.0] 98.7] 96.1} 98.2} 95.5]103.0}101.1)115.8] 99.4/107.3) 96.4] 99.0 
AMM rns shoboapescoad 98.4] 98.6] 97.7/105.0]103.0] 97.1]105.0/103.7) 96.1]104.9]} 98.9] 96.0 
Uys c, Astare a Sisicere avalos 99.5] 97.9}100. 0/101. 0)/101.'7/101. 2/103. 7} 91.0/103.0} 95.3) 99.2}113.4 
UU SIIS Ute cr eras einleloleisyer 96.7/109. 2}104, 2)101. 2)103. 4/101. 2/105. 1}105, 3/112. 1)106. 8}109. 0)120.9 
Septembersticc 1s ccs ese 110. 2/112. 3/109. 8/113. 4/115. 4]110. 0/112. 6/111. 4/105. 8/104. 3/110. 5/114. 3. 
Octoberiaccneg se ce cere 115. 5) 114, 9)115. 2)120. 0/111. 9/119. 3)120. 6/112. 6/114. 3)109. 0/113. 3)112.5 
November..........- 112.9)117. 5/117. 7/117. 1]107.7/117. 9/114. 6}105. 3) 113. 5] 120. 8}113. 6}129.2 
December............ 114, 6)122. 1/114. 7/120. 4}117. 4/116. 3/118. 3/109. 9}116. 5} 114. 6/114. 2}115.6 


It remains to determine the average relationship of 
January prices to December prices, of February prices to 
January prices, ete. In the preceding example this was done 
by getting arithmetic averages of each month’s prices and 
reducing these to percentages. In employing Persons’ 


MEASUREMENT OF SEASONAL FLUCTUATIONS 319 


method the median link relatives are found. That is, all 
the January link relatives are grouped in order of magnitude, 
and the median value determined, the same being done for 
each of the other months. The median link relatives thus 
secured are given in column (2) of Table 78. 


TABLE 78 


The Construction of Indices of Seasonal Variation by the Method of 
Link Relatives 


(1) (2) (3) (4) (5) 
Median Link 


: ; Corrected Indices of 
Relatives o Chain : 

Month Monthly ae Relatives len Seasonal 

Price elatiwes Variation 
January... 103.4 100.0 100.0 139.5 
February..... 90.1 90.1 89.7 125.1 
March: ...3.. 80.5 U250 71.9 100.3 
April. . 79.5 57.4 56.7 79.1 
MA sak yasoe:eie 99.2 56.9 56.0 78.1 
UNGs See see 98.7 56.2 55.1 76.9 
ULV er oe so 6 6s 100.5 56.5 55.2 77.0 
AU PUSES. fe. oe 105.2 59.4 57.8 80.6 
September... . 111.0 65.9 63.9 89.1 
October...... 114.6 15.5 4229) 101.7 
November.... 115.8 87.4 84.0 LViRe 
December. ..: 116.0 101.4 97.1 135.4 
LONGO SR oe aS Oe 71.69 100.0 


These median link relatives measure the monthly varia- 
tions in terms of a shifting base, each month being the base 
for the one which follows. It is next necessary to express 
the monthly variations as percentages of a constant base. 
To this end chain relatives are constructed, with January 
as base. The January figure is thus 100; the February 
figure is 90.1, which corresponds to the median link relative, 
since all the February link relatives are computed on 
January as base. But the March median link relative is 
80.5 with February as a base; the March chain relative, 
with January as base, will be 80.5 X .901 or 72.5. Similarly 
the chain relative for each month is computed by multi- 


320 THE ANALYSIS OF TIME SERIES 


plying the median link relative of that month by the chain 
relative of the month preceding. The figures shown in 
column (3) of Table 78 are secured. 

If there were no upward or downward trend in the data 
the product secured by multiplying the December chain 
relative by the January median link relative would be 100. 
Actually the product of 1.014 and 103.4 (the latter being 
the January median link relative) is 104.8. The existence 
of a constant trend, affecting each of the twelve median 
link relatives, is responsible for the difference. This, it 
should be noted, is the result of a cumulative error, com- 
pounded twelve times when the median link relatives are 
multiplied by the successive chain relatives. The case is 
identical with the compounding of a sum of money at a 
fixed rate of interest. The error in the present case (the 
rate of monthly increase due to secular trend) corresponds 
to the rate of interest. The formula for the growth of a 
fixed sum (P,) compounded at a constant rate of interest 
over n years is as follows (P, representing the principal at 
the end of n years): 

P, = Pol +1)". 


The rate of growth is thus represented by 
»/p_ 
= / Py —1. 
The values secured in the present case may be inserted 
in these formulas: 
104.8 = 100(1 + r)® 


hs 2/104.8 ; 
* 100 


1.004 - 1 
= .004. 


The increase due to secular trend is at the rate of .004 
per month. To eliminate the effect of this factor the 
February chain relative must be divided by 1.004, that is, 


MEASUREMENT OF SEASONAL FLUCTUATIONS 321 


by (1+ 7), the March chain relative by 1.008, (1+ 7)’, ete., 
the December chain relative being divided by 1.044, or 
(1+ 7)". In this way the effect of long-term trend is elim- 
inated from the chain relatives which are to be used as 
measures of seasonal variation. These corrected chain 
relatives are given in column (4) of Table 78. 

It is desirable that an index of seasonal variation be based 
upon an average monthly figure, rather than upon the 
January figure. With the average as base the total of the 
twelve monthly percentages will be 1200, and the interpre- 
tation and use of each monthly figure is somewhat simpler. 
When the monthly figures in the preceding table are thus 
expressed as percentages of their average (71.69) the 
indices in column (5) of Table 78 are secured. 


Tue Use or Movine AVERAGES 


Another method of measuring seasonal variations is 
offered by the device of moving averages. Since these 
fluctuations take place within a constant period of twelve 
months, a moving average may be used with more confidence 
than when a cycle of varying length is involved. The 
magnitude of the fluctuations (the amplitude of the seasonal 
swings) will not ordinarily be constant, hence the line 
marked out by the moving averages will not be completely 
free of seasonal influences. The relation of the actual 
monthly items to the moving averages may be averaged, 
however, and the indices of seasonal variation based upon 
these averages. 

It is essential, of course, that the moving average, cen- 
tered, fall at the same date as the original figure with which 
it is to be compared. This involves a second process of 
averaging. For example, the egg prices given are as of the 
first of each month. The average of the twelve monthly 
items for 1910, when centered, falls on June 15th. The 
average of the items from February, 1910, through January, 


322 THE ANALYSIS OF TIME SERIES 


1911, centered, falls on July 15th. To secure a figure com- 
parable with the July 1st price, these two must be averaged. 
By this process of computing a two-month moving average 
from the twelve-month moving average comparability with 
the original figures is secured. The following table presents 
the averages obtained in this way: 


TABLE 79 
Moving Averages of Egg Prices, 1910-1921 


(12-month moving average, centered, adjusted by 2-month moving average, centered) 
(Prices are given in cents per dozen) 


Month 1910 | 1911 | 1912 | 1913 | 1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 
dERUEN Een nonooopadede |lboone 20. 25/21. 26/20. 78/22. 74/22. 29/22. 24/30. 06/36. 72/41. 39/46. 60/40. 47 
Hebruaryarr asic cs sieiren | steletsys 19.99/21. 45/20. 79/22. 80/22. 20/22. 51/30. 80/37. 01/41. 86/46. 63/39. 30 
MENG gandbooogennes jleZcor 19. 82/21. 60/20. 79/22. 91/22. 06/22. 86/31. 59/37. 33/42. 25/46. 80/38. 16 
Aprile muwiees (ce Gy wl ices 19. 64/21. 75/20, 87/22. 97/21. 91/23. 29/32. 39/37. 64/42. 57/47. 15/36. 93 
May srtremrsictstelais sro fcisioreye 19. 46/21. 94/20. 99}22. 89)21. 90/23. 77)33. 08/38. 14/42. 98)47. 50/35.74 
TUNG re crareiaferslerve cisco le cisier 19. 38/22. 08/21. 19}22. 66/21. 98/24. 33/33. 59/38. 96/43. 56/47. 75/34. 63 
dit ison dnposanneGuogD 22.47/19. 33}22. 00}21. 49]22. 57/21. 98}24. 94/34. 17/39. 90/44. 17/47. 72)..... 
JENUE SIS) io ODO HOOGE 22.18]19. 58/21, 63/21. 88}/22. 64/21. 84/25. 60/35. 10/40. 31)44. 84/47. 26]..... 
Deptemberiwarcclsiieisie 21. 63/20. 20/21. 16/22. 32/22. 55/21. 74/26. 51/35. 93/39. 96/45. 76/46. 24)..... 
Metoberwarrrcctee sie ss 21. 21/20. 66/20. 89}22. 57/22. 39/21. 78127. 36|36. 43/39. 79|46.50\44.74)..... 
Noveniber eave sels. 20.89/20. 88/20. 79/22. 65)22. 36/21. 88/28. 20/36. 69)40. 17/46. 71)43.26)..... 


Decembenternersise aeisrs 20.57/21. 07/20. 76/22. 69|22. 35/22. 02/29. 20/36. 68/40. 77/46. 67/41. 81]}..... 


The actual prices are now expressed as percentages of the 
corresponding moving averages. These percentages are 
given below: 

TABLE 80 


Percentage Relation of Actual Egg Prices to 12-Month Moving Averages 


Month 1910 | 1911 | 1912 | 1913 | 1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 
January....... aicratctes || ores 150. 1]138. 8)129. 0/135. 0)141. 8/137. 6)125. 4/126. 1/138. 2/139. 1]151.0 
DQ OTE Menchndaocone. |looadc 110. 6/135. 7/109. 7/124. 6}131. 5/119. 1)116, 2/133, 5/115. 4/122. 0]126.2 
Manel ristsi cisie pevesee [sence 83, 2/113. 4] 93.3]105.6] 96.6] 92.7/107.0/108.2| 78.3] 99.6] 76.5 
MADE stereie elects socceces [oceae 75.9) 81.8] 78.6] 76.6) 75.8] 76.9) 80.0) 82.9] 80.6] 82.3] 55.2 
IM ay eiate te aicte’s Siois/eteieus (eee 75.5) 77.9) 76.7) 73.4) 78.1) 76.1) 90.7) 81.3] 85.6] 78.7) 56.5 
AIMS ospaceonor se Wee [lcbe ele 74.8) 75.6) 79.8] 76.3) 75.5] 78.1) 92.6] 76.5] 88.6] 77.5] 56.0 
DU Yiaesewvis sissies says 81.0] 73.5] 75.9) 79.1) 78.0) 76.4) 79.0} 82.8] 76.9} 83.3] 76.9]..... 
apustinircenontee tas 79.4) 79.2) 80.4) 78.6) 80.4) 77.8] 80.9] 84.9] 85.3] 87.6] 84.6]..... 
September... 2.0.4. 89.7} 86.1} 90.3) 87.4) 93.1) 86.0] 87.9] 92.4) 91.1] 89.6] 95.6]..... 
Octoberiinsccsw dae one 105.6} 96.8}105.3)103.7/105. 0/102. 4/102. 7}102.7/104. 5) 96.1]112.0]..... 
November........... 121. 1]112. 5] 124. 6/121. 0113. 1/120. 2)114. 2]107. 4]117. 6)115.6}131.5]..... 
Decombent.castntte.s 141. 0/136. 2/143. 1/145. 4/132. 9/139. 0/130. 5}1 Bl itecrere 


18.1 AS ANB LE CaS 155. 


These percentages show considerable variation from year 
to year in the relation of the figures for a given month to 


MEASUREMENT OF SEASONAL FLUCTUATIONS 323 


the moving average. Thus the January figures, while always 
above the average, vary from 125.4 per cent to 151 per cent 
of the average. The eleven percentages secured for each 
month must be averaged to obtain the index desired. 
Either the arithmetic average or the median may be em- 
ployed for this purpose. The results secured by applying 
the two methods are shown in the following table. In 
columns (2) and (8) the actual arithmetic means and 
medians are given. These indices may be rendered more 
serviceable by adjustments which make the average in 
each case equal to 100. These adjustments have been 
made in computing the index numbers appearing in columns 


(4) and (5). 


TABLE 81 
Indices of Seasonal Variation in Egg Prices Computed from Moving 
Averages 
(1) (2) (3) (4) (5) 
Arithmetic Madiuns Arithmetic Va es 
Month Means : Means : 
(Unadjusted) (Unadjusted) (Adjusted) (Adjusted) 
ganuary...... 137.4 138 .2 137.9 138.7 
February..... 12222 122.0 122.6 12255 
PATCH RE ack. 95.9 96.6 96.2 97.0 
AN al ae cee GsiaO 78.6 lee 78.9 
IMA Vie cre "7.3 77.9 77.6 78.2 
OMING « tatne. ve 774 76.5 Ci bath 76.8 
Folge. 78.4 78.0 18.7 18.3 
August....... 81.7 80.4 82.0 80.7 
September.... 89.9 89.7 90.2 90.1 
October...... 103.3 103.7 103.7 104.1 
November.... 118.1 117.5 118.5 118.0 
December.... 137.2 136 .2 137.7 136.7 
Average.... 99.65 99.608 100.0 100.0 


Toe ComputaTION oF INDEX NUMBERS OF SEASONAL 
VARIATION BY AVERAGING Ratios To TREND 


A further method of securing seasonal indices, which has 
certain distinctive advantages, has recently been devoloped.* 


1 The essentials of this method were worked out independently by Helen D. 
Falkner (“The Measurement of Seasonal Variation,” Journal of the American 


324 THE ANALYSIS OF TIME SERIES 


In the application of this method, a suitable line of trend, 
linear or non-linear, is fitted to the data, the actual monthly 
items are expressed as percentages of the corresponding 
trend figures, and then, for each month, an average of the 
percentage ratios of the actual to the trend values is secured. 
This procedure is identical with that described in connec- 
tion with the use of moving averages, except that the actual 
values may be expressed as percentages of normal values 
derived from any function employed to represent trend. 
In the selection of an average value for each month, use 
may be made of a multiple frequency table in obtaining 
an understanding of the nature of the actual seasonal 
movement. With the help of such a table the existence of 
a definite seasonal movement may be verified and the 
type of average to be used in securing a typical value for 
each month may be determined. We shall apply this 
method to the data employed in the preceding examples. 

A straight line, fitted to annual average egg prices from. 
1910 to 1921, as given in Table 75, is described by the 
equation 

y = 29.45 + 2.322 

with origin at Jan. 1, 1916. Normal values for each month 
may be computed readily, by methods previously described. 
Expressing the actual values as percentages of the trend 
values we secure for each month twelve such percentage 
ratios (since the data cover twelve years). The twelve 
January percentages vary from 103.5 to 194.8, the twelve 
May percentages from 48.2 to 113.7, ete. The multiple . 
frequency table which appears in Fig. 61 is constructed 
by classifying, in the form of a frequency distribution, the 
items for each month. 

The presence of a distinct seasonal variation is demon- 
strated by this table. For the winter months the per- 


Statistical Association, June, 1924, 167-179) and Lincoln W. Hall (‘‘Seasonal 
Variation as a Relative of Secular Trend,” Journal of the American Statistical 
Association, June, 1924, 156-166). 


MEASUREMENT OF SEASONAL FLUCTUATIONS 325 


centages are consistently higher than for the spring and 
summer months, and we may conclude that there is a 
pronounced seasonal variation capable of representation 
by index numbers. ; 

This table facilitates the selection of a type of average for 
the measurement of the seasonal movements. The median 
is likely to be unrepresentative, and subject to material 


5 es 
df / 
ysl 


70 and Over 
2) ee ee ee ee 

OSE Bl Be ee ee + Sra 
55 — 159.9 

50 — 1549 
45 — 1499 
40 — 1449 
35 — 1399 
30 — 1349 
125-199] | 
120-1249] 7 | v7 | 
s-n99] mw || 
no-n49o] |_| 
[105-1099] | 


at 


—_ 


L 


Be 
: s| 
g 


— 
‘) 
3° 

| 
= 
RO | 
El 

= 


=> 
~ 


= 


~ 


Tt fi 
~Il™~ 


= 
~ 


| 


/ 


pemeea tr 
nder 50 | | ies 


Fic. 61. — Frequency Distributions of Ratios of Monthly Egg Prices to the 
Corresponding Ordinates of Secular Trend. (Based on data for the period 
1910-1921) 


~ 


HERSeSets 
BERGE 


change in value by the addition or withdrawal of one or 
two entries, unless there is a definite concentration in the 
monthly frequency distributions. The arithmetic mean 
of all the items, on the other hand, will be unduly affected 


326 THE ANALYSIS OF TIME SERIES 


by exceptional cases. An alternative method is provided 
by the possibility of taking the arithmetic mean of the 
central items for each month. If an inspection of the 
multiple frequency table does not lead to an immediate 
decision as to which is the best type of average to employ 
in a given case, several index numbers may be worked 
out for each month, and a decision reached after a com- 
parison of the results. From the percentage ratios of actual 
to trend values, as computed from the data of Table 75, 
six different averages have been worked out for each month. 
The simple averages constitute the unadjusted index num- 
bers given below. Correcting these so that the average 
of each group is equal to 100, we secure the adjusted index 
numbers presented in the same table. (These averages 
have been derived, not from the frequency distributions 
shown in Fig. 61, but from the individual percentages of 
actual to trend values.) 


TABLE 82 

Unadjusted and Adjusted Index Numbers of Seasonal Variation in 
Egg Prices 

(Based upon Percentage Ratios of Actual Values to Linear Trend Values.) 


Unadjusted Index Numbers Adjusted Index Numbers 
Month (based upon given number of (based upon given number of 
central items) central items) 


Ganwany neice.) se ee 140. 5)138. 2/138. 0/139. 2/139, 8)141. 4142. 1/139. 4/138. 4]189.1}139. 6]140.5 
IRebIuAIyamrccrsieteeir sos 120. 8/121. 6)123. 4/123. 8/123. 6/125.7)122. 2/122. 6/123. 8/123.7/123. 5124.9 
DEAL Cheeleicreesieietelsiete ere 92.8] 95.0] 96.9} 97.1] 96.7] 98.3] 93.9] 95.8] 97.2) 97.1] 96.6] 97.7 
J Nay dl bs eeponandegane 80.0] 79.0] 79.1) 78.1] 78.3] 78.9) 80.9] 79.7] 79.3] 78.1] 78.2] 78.4 
ME aiyscrarsteteiere iecoterstestcesaas 79.8) 79.3) 79.1) 78.8) 78.8] 79.2] 80.7) 80.0) 79.8] 78.8] 78.7| 78.7 
Dune ecm yaaa lefevre 77.6| 77.9) 78.6) 78.6) 79.0) 78.9] 78.5) 78.6] 78.8] 78.6] 78.9] 78.4 
WULY sr rametonieenersisre ce 76.1) 77.3) 77.3) 77.5) 77.7) 78.1) 77.0) 77.9) 77.5]. 77.5) 77.6) 77.6 
ISUGUSG ss sapere epersares 80.3} 80.5} 81.4) 81.8) 82.1) 82.0] 81.2) 81.2) 81.6] 81.8) 82.0) 81.5 
September........... 88.3] 89.2) 89.7) 90.1) 90.2) 90.0] 89.3) 89.9] 90.0] 90.1] 90.1] 89.5 
OctoberNienaweve sss 100. 7/102. 2)102. 4)102. 4/102. 3/102. 4)101. 8]103. 1/102. 7/102. 4/102. 2/101.8 
November.........++ 116, 6)115.7/116. 0}117. 5}117.3}117. 3/117. 9) 116. 7/116. 3}117. 5}117. 2/116. 6 
December... ¢ ca sieise.0 133. 0}134. 0/134. 7/135. 6)/135.5)135. 2)134. 5/185. 1/135. 1/135. 6|135. 4/134. 4 


The unadjusted index numbers in the first column of 
figures are the arithmetic averages of the two middle items 


MEASUREMENT OF SEASONAL FLUCTUATIONS ° 327 


in each monthly frequency distribution; the figures in the 
next column are the arithmetic averages of the four central 
items, etc. 

In the present instance the seasonal variation is so 
pronounced and regular, year after year, that no striking 
differences appear in the results secured by averaging 
varying numbers of items. The index numbers based upon 
averages of the four central items for each of the twelve 
months are perhaps the best to employ in this case. In 
general, an average of three, four or five central values is 
more likely to be stable and representative than either the 
median or an average of all the items for each month. The 
greater the concentration in the monthly frequency tables, 
the smaller the number of items upon which the index 
numbers may be based. 

In the following table four sets of index numbers of 
seasonal variation in egg prices, secured by the different 
methods illustrated above, have been brought together 
for comparison. 


TABLE 83 
Indices of Seasonal Variation in Egg Prices 
(Adjusted) 
C Computed from 

omputed from Computed from Pecsibas 

Moving Arithmetic Computed from ee 

Month Average Means of Link Relatives Ratios to 
Medians Monthly Prices Trend 

(Average of 4) 

VanNUarye. scl 138.7 138.8 139.5 139.4 
February..... 122.5 123.9 125.1 122.6 
March... <5. 97.0 96.5 100.3 95.8 
AN ayo ie eee 78.9 78.1 79.1 MDE 
IW ERS een iee 78.2 78.4 78.1 80.0 
UNE ha acres ote 76.8 78.1 76.9 78.6 
JULY yite cee. 78.3 77.3 (HELD TD 
Aniguste. 1. 80.7 82.0 80.6 81.2 
September.... 90.1 89.9 89.1 89.9 
October...... 104.1 102.3 101.7 103.1 
November.... 118.0 118.1 ale Gis 
December.... 136.7 136.6 135 .4 135.1 


Average.... 100.0 100.0 100.0 100.0 


328 THE ANALYSIS OF TIME SERIES 


CoMPARISON OF INDICES OF SEASONAL VARIATION 


These four indices show a fairly close agreement as to 
the seasonal variations in egg prices. Discrepancies of 
some importance occur in the months of February and 
March, the difference between the two extreme figures 
amounting to more than four per cent in the latter month. 
There is no absolute criterion by means of which the best 
method of measuring seasonal variation may be determined 
but certain general conclusions may be suggested. The 
arithmetic mean of actual prices, while simple in calcula- 
tion, should not be applied if extreme variations are 
present, or if the data change materially in value during 
the period covered. The use of link relatives involves a 
considerable amount of routine calculation, and in the 
analysis of homogeneous and fairly regular series this 
method does not appear to be superior to the simpler 
methods.! The averaging of percentage ratios to moving 
averages or to normal values determined by fitting appro- 
priate lines of trend is probably the most generally appli- 
cable and most useful method. The arithmetic operations 
are simple, particularly when normal values are based 
upon mathematical lines of trend, and there is sound 
theoretical justification for the averaging of ratios rather 
than absolute values. 

1 W. M. Persons points out the following advantages of the method of 
link relatives. (The first two of these, it may be noted, are shared by the 


methods which employ ratios to moving averages and to mathematical lines ~ 
of trend.) 

a. The frequency distributions of link relatives enable one to judge the 
degree of regularity of month-to-month or seasonal changes. 

b. The use of the median (or average of central items) is a device by which 
the influence of large non-seasonal variations may be greatly moderated. 

c. It is possible to use non-homogeneous statistical series in the measurement 
of seasonal variation. For instance, link relatives for bank clearings of 50 repre- 
sentative cities for one interval may be combined with link relatives for the 
clearings of 100 representative cities for another interval. 

— Journal of the American Statistical Association, June, 1923, 717. 


MEASUREMENT OF SEASONAL FLUCTUATIONS 329 


Tuer DETERMINATION OF Montuity NorMAL VALUES 


Before proceeding to a problem which involves the 
application of corrections for both trend and seasonal varia- 
tions, the steps to be taken in making the transition from 
an annual to a monthly basis, in the computation of normal 
or trend values, should be outlined. Only simple arithmetic 
calculations are involved. 

The following figures, arbitrarily selected, may be used to 


exemplify the procedure. 
Normal Annual Production, 


ewe Commodity X 
1910 186 
1911 330 
1912 AT4 
1913 618 


The data are plotted in Fig. 62, A. 

These figures represent the normal production during 
four successive years. The line of trend in this case is 
described by the equation 


y = 186 + 1442 


with origin at 1910. Since each annual figure is a total, it 
is to be considered as centered at the middle of the period 
covered (i.e., July Ist) and the line is drawn through the 
points so located. This line has a slope of 144, the annual 
increase in production during this period. 

We may consider the reduction to a monthly basis to be 
made in two steps. First, we secure an equation in which 
the annual production is expressed in terms of monthly 
averages. This is done by dividing the values 186 and 144, 
as given in the preceding equation, by 12, securing 

y = 15.5 + 12e 


with origin at the middle of 1910. This equation is plotted 
in Fig. 62,B. Each of the narrow columns in this diagram 
represents the production during a month centering at the 


330 THE ANALYSIS OF TIME SERIES 


Total Annual Production 


600 Secilar pe L- 
=186+ 
@ eee at ney OO aie eee 
400 a 
AtmiAl \n( 7). web aaea eee * pee 
ine : 
= oe 
0 110 1911 1913 
50 iss Monthly Production 
40 | Secular Trend 
Increase ® Ci 2 Sul io 1910) 
(irom GL || ee SOL Bee Bee se 
pa : 
fo the sane { 20 < 
TONE OFM. emer measly pkeees 
the followi 10 
year 
ok lin | 
1910 191] 1912 1913 


| Computed Production by Months 
© ee rend 


ii H 
40 (righ Sis ae 19) Q 
Incr H ik 
from Destanasunndvacnses sacee 
to 7) 
The rill 


Fic. 62. — ‘hon the Method of oe oe ake iG na 
(The x-unit is one year in (A) and (B), one month in (C) 


middle of the year in question (i.e., July Ist). The slope of 
the line in this figure is 12, which represents the increase 
in production during a given month in one year as compared 
with the same month in the year preceding. Thus normal 
production for the month centering at July 1, 1910 is 15.5; 


MEASUREMENT OF SEASONAL FLUCTUATIONS 331 


normal production for the month centering at July 1, 1911, 
is 27.5, etc. The z-unit in this equation, it should be noted, 
is one year. 

The next step is the computation of the monthly normal 
or trend figures. If the increase in normal monthly pro- 
duction between July 1, 1910, and July 1, 1911, is 12, the in- 
crease from July 1, 1910, to August 1, 1910, is 1. Thus, for 
convenience, we may change the abscissal unit from one 
year to one month. With this unit, the equation would be 

y=15.54+2 


with origin at July 1, 1910. This is not a convenient origin, 
since actual monthly data would be considered as centered 
at the middle of each month. If the ordinate at July 1, 1910, 
is 15.5, and the monthly increment is 1, the ordinate at 
January 15, 1910, would be 10. Shifting the origin to 
January 15, 1910, accordingly, we have the equation 


y=10+2 


From this equation the normal production for any month 
may be readily determined. Figure 62, C, illustrates the 
relation between normal production in successive months 
during the period 1910-1913. 

In Fig. 63 the relations between total annual production 
(normal), average monthly production (normal) and monthly 
normals for the year 1910 are brought out in greater detail.! 
The particular point which is emphasized in this chart is the 
necessity for a correction of one half month in deriving 
the monthly normals from the average monthly figures. 

In series such as that illustrated above the monthly 
increment due to trend is equal to the annual increment 
divided by 144. This is true in all cases in which the annual 
figure is the sum of the twelve monthly figures, as with all 
data relating to production. In price series or series of any 
other type in which the annual figure is the average of the 
twelve monthly figures, the monthly increment is equal to 

1 Figs. 62 and 63 were devised and prepared by D. H. Davenport. 


332 THE ANALYSIS OF TIME SERIES 

the annual increment divided by 12. The first of the two 
steps illustrated above is unnecessary in such cases. 

Total Production in 1910 


Trend Value 


ete} aoe tio ee eee 
eS al a 
Cece eee 
Pedal |r [esata 
i 
Average Monthly Production in 1910 
( Trend} pao 


20 
10 


| 
Monthly Production i in 1910 
a Value) 


JE: OMOMA TH )e SUS) aaa OME NARS 


Fic. 63. — Showing the Relation between Trend Values of Total Annual 
Production, Average Monthly Production and Production by Months 


Tue MrasurEMENT or CycuicaAL FLUCTUATIONS 


There remains the task of combining the corrections for 
secular trend and seasonal variation in order to secure — 
measures of cyclical changes in a given series. Major 


MEASUREMENT OF SEASONAL FLUCTUATIONS 333 


interest in most economic studies attaches to these cyclical 
changes, and the measurement of such changes is usually 
the central problem in the analysis of time series. To illus- 
trate the procedure we shall analyze the figures for bitumin- 
ous coal production in the United States, from 1901 to 1923. 

An appropriate line of trend must first be fitted to the 
annual production figures for this period. An equation of 
the type 

log y= a+ br + cx? 

is selected, and the values of the constants are determined by 
the method of least squares. The equation to the curve 
which describes the trend is 


log y = 2.646052 + .0142262 — .001024622 


From this equation the normal production for the several 
years may be computed. These are given in the follow- 
ing table, in comparison with the actual figures. 


TABLE 84 


Bituminous Coal Production in the United States 
Comparison of Actual and Normal Values, 1901-1923 


(1) (2) (3) (4) (5) 


Actual Production ? Normal Production Actual Production 
Logarithm 


Year in millions o in millions o as Percentage 
net tons) ; He NG?! net tons) ‘ of pivot 
1901 225 .83 2.365589 232 .05 97.3 
1902 260.22 2.401332 251.96 103.2 
1903 282.75 2.435025 272.29 103.8 
1904 278 .66 2.466670 292 .87 9omk 
1905 315.06 2.496265 313.52 100.5 
1906 342 .87 2.523810 334.05 102.6 
1907 394.76 2.549307 354.25 111.4 
1908 302.01 2.572754 373 .90 88.9 
1909 379 .'74 2.594153 392.78 96.7 
1910 417.11 2.613502 410.68 101.6 
1911 405.91 2.630801 427 .36 95.0 
1912 450.10 2.646052 442 .64 101.7 
1913 478 44 2.659253 456.30 104.9 
1914 422.70 2.670406 468 .17 90.3 
1915 442 .62 2.679509 478 .09 92.6 


_ 
(—) 
(Je) 
iS 


1916 502.52 2.686562 485 .92 


334 THE ANALYSIS OF TIME SERIES 


TaBLE 84 (Continued) 


(1) (2) (3) (4) (5) 


Actual Production : Normal Production Actual Production 
Logarithm 


r in millions o in millions o, as Percentage 
ae net tons) qi Re net tons) f of Normal 
1917 551.79 2.691567 491.55 112.3 
1918 579 .38 2.694522 494.91 ral 
1919 465 .86 2.695429 495 .94 93.9 
1920 568 .67 2.694286 494 . 64 115.0 
1921 415 .92 2.691093 491.01 84.8 
1922 404.51 2.685852 485 .12 83 . 4 
1923 545 .30 2.678561 477 .05 114.3 


From this table the cycles in bituminous coal production, 
as shown by annual production figures, may be determined 
directly. Our present problem, however, is the measurement 
of cyclical changes in coal production, as reflected in monthly 
figures. 

The first step is the determination of the ordinate of the 
line of trend for each month of the period covered. This 
ordinate measures the ‘“‘normal”’ production for that month. 
The line of trend has been fitted to annual data, and it 
is necessary to effect the transition to a monthly basis. 

Each of the figures entering into this series is to be 
considered as representing the middle of the period covered. 
That is, the total production for the year 1913 is to be con- 
sidered as centered at July 1, 1913; the figure for production 
in January, 1913, is to be taken as of January 15, 1913. 

The trend value for 1913 is 456.30 millions of tons. This 
is taken as normal for the year centering at July 1, 1913. 
Dividing this by 12, we have 38.025 millions of tons as the 
normal production for the month centering at July 1, 1913. 
As this figure stands, it cannot be compared with either the 
June or the July figure for actual production, for there is a 
discrepancy of 15 days. This may be readily adjusted. 
The normal value for the year 1914, divided by 12, gives 
39.014 millions of tons as normal for the month centering 
at July 1, 1914. The difference between 39 .014 and 38 .025, 
or .989, represents the difference in average monthly pro- 


MEASUREMENT OF SEASONAL FLUCTUATIONS 335 


duction, due to trend, between a given month of one year 
and the same month of the preceding year. The increment 
from one month to the next is equal to .989 divided by 
12, or .0824. This represents the increase from July 1, 1913, 
to August 1, 1913. The increase from July 1, 1913, to July 
15, 1913, is one half of this, or .0412. It has been shown that 
normal for the month centering at July 1, 1913, is 38.025 
millions of tons. Normal for the month centering at July 
15, 1913, is 38.025 + .0412, or 38.0662 millions of tons. 
Normal for the month centering at August 15, 1913, is 
38 .0662+ .0824, or 38.1486. The trend value for the month 
of September is 38.1486 + .0824, or 38.2310. The normal 
production for each of the months between July, 1913, and 
June, 1914, may be computed in a similar fashion. 

Were the trend of the series linear throughout, monthly 
normals for the entire period could be computed in this way 
by adding a constant increment to each monthly normal to 
secure the normal value for the following month. But in 
this series the rate of normal growth changes from year to 
year,! and the increment of normal value changes cor- 
respondingly. The table on page 336 presents the normal or 
trend figure for July of each year (centering at the 15th 
of each July), with the monthly increment of normal pro- 
duction for the various years, for the period 1912-1922, 
inclusive. 

This change in the rate of growth must be taken account 
of in computing the monthly normals. The process which 
has been explained in detail for the period July, 1913, to 
June, 1914, must be repeated for each twelve-month period. 

It remains to compute the indices of seasonal variation. 
This may be done by any of the methods which have been 
described above. In the present instance these indices were 
computed by the method of moving averages, employing 

1 Strictly, the rate of normal growth changes from month to month. Interpola- 
tion on the assumption of a linear trend for each twelve-month period involves no 


appreciable error, however, and is much easier than with a more complicated 
equation. 


336 THE ANALYSIS OF TIME SERIES 


TABLE 85 


Changes in the Rate of Normal Growth, 
Bituminous Coal Production, 1912-1922 


Monthly Increment of 


SEG SIC UEH NL Normal Production for twelve months 


Date OCT following the given date (in 
nel cons) millions of net tons) 
July, 1912 36.9341 .0948 
July, 1913 38 .0662 .0824 
July, 1914 39.0490 - .0690 
July, 1915 39.8680 .0543 
July, 1916 40.5120 .0391 
July, 1917 40.9742 .0233 
July, 1918 41.2461 .0071 
July, 1919 41 .3238 — .0090 
July, 1920 41.2074 — .0252 
July, 1921 40.8970 — .0409 
July, 1922 40.3987 — .0560 


monthly production figures for the period 1913-1921.1. The 
values are shown in Table 86. 

The first step in the computation of cycles in bituminous 
coal production is the reduction of the actual monthly figures 
to percentages of the monthly normal or trend figures. The 
results are shown in column (4) of Table 86, which illustrates 
the process for two of the years covered in the study. These 
percentages are to be corrected for seasonal fluctuations, 
the indices of which are given in column (5). It is obvious 
that if the percentages in column (4) are less than the cor- 
responding percentages in column (5), the deviations of the 
actual figures from normal are negative, and if the reverse 
is true they are positive. The deviations in column (6) 
are secured by subtracting the indices of seasonal variation 
from the percentages in column (4). The items in column 
(6) are thus measures of the cyclical movements in bi- 
tuminous coal production, the effects of both secular trend 
and seasonal fluctuations having been eliminated. The 

1 The indices of seasonal variation in bituminous coal production which have 


been employed in this problem were computed by the Statistical Department of the 
Federal Reserve Bank of New York. 


MEASUREMENT OF SEASONAL FLUCTUATIONS 337 


effects of accidental and irregular movements have not, of 
course, been removed. The indices in column (6) are taken 
to measure cyclical changes on the assumption that such 
accidental changes are mutually offset in the long run. 
(In the case of coal, this assumption is not always justified.) 


TABLE 86 
Computation of Cycles in Bituminous Coal Production 
1913-1914 
(1) (2) (3) (4) (5) (6) 
Percentage 
Actual Normal Actual as Indi Deviations 
Y Month Production Production Percentage S re from normal 
Bas Se (in millions (in millions | of Normal he corrected for 
of net tons) of net tons) (2) + (3) Variation Seasonal 
‘ Variation 
(4) — (5) 
PANUATY. aotsie oie’ eiefeiets 42.274 37.503 112.7 107 + 5.7 
February........... 37.057 87.598 98.6 92 + 6.6 
" INT arelay cle ctscorereisietels 37.536 37. 693 99.6 102 —2.4 
34.169 37.788 90.4 83 +7.4 
37.205 37.882 98.2 92 + 6.2 
1913 37.405 37.977 98.5 95 + 3.5 
38.858 88. 066 102.1 97 + 5.1 
41.590 38.149 109.0 106 + 3.0 
41.424 38.231 108.4 106 + 2.4 
46.164 38.313 120.5 113 + 7.5 
43 . 233 38.396 112.6 104 + 8.6 
41.519 38.478 107.9 103 + 4.9 
SAUUOLY Feiss ercleicis Fe 40.191 38. 561 104.2 107 — 2.8 
Bebruary, co.ciccs'ss0.0.0 35.472 38. 643 91.9 92 — 1 
Marcin tcc. coca 45.455 388,725 117.4 102 + 15.4 
INDY tanectecc coms 23. 609 38.808 60.8 83 — 22.2 
ay crartacrete iis /«a.p077 28.551 38.890 13.4 92 — 18.6 
TELS OUT hee 31.412 88.973 80.6 95 — 14.4 
2h Oe ee oe 34.305 39.049 87.9 97 —9.1 
PS RISUS tar arciai cate evesevevers 37.751 39.118 96.5 106 —9.5 
September.......... 39.019 39.187 99.6 106 — 6.4 
October netieticre seis. e 37.685 39. 256 96.0 113 — 17.0 
Noyember’........55%« 33.392 39.325 84.9 104 — 19.1 
December.......... 35. 862 39.394 91.0 103 — 12.0 


The calculations are shown in detail for only two years. 
When the same process is repeated for later years, the 
results presented in the following table are secured. The 
actual monthly data for the period 1913-1923 are shown, 
together with the percentage deviations from normal cor- 
rected for seasonal variation. 


338 THE ANALYSIS OF TIME SERIES 


TABLE 87 


Bituminous Coal Production, 1913-1923 
Actual Monthly Production and Corrected Deviations from Normal 


1913 1914 1916 


Actual Corrected Actual Corrected Actual Corrected 
Production | Deviations | Production | Deviations | Production | Deviations 
(in millions of|from normal] (in millions of|from normal] (in millions of|from normal 


net tons) (percentage) net tons) (percentage) net tons) (percentage) 
SaNUAryicterseneres 42,274 + 5.7 40.191 =2:8 37.194 — 12.7 
February........ 37.057 + 6.6 35.472 =—.1 29.321 — 17.8 
Marebi tac cst. ose 387.536 — 2.4 45.455 + 15.4 81.801 — 21.7 
April ticseccrecsres 34.169 + 7.4 23.609 — 22.2 29.968 — 7.5 
May recite atsties.0e 37.205 + 6.2 28.551 — 18.6 30.938 a ea | 
PUNE nes aero sisicws 387.405 + 3.5 31.412 — 14.4 33.957 — GUT 
July esate ct cctsis 88.858 + 6.1 34,305 —9.1 85.573 — 7.8 
INSTA ein ABO OUS 41.590 + 3.0 37.751 —9.5 38.161 — 10.4 
September....... 41.424 + 2.4 39.019 — 6.4 40.964 — 3.5 
October. .......- 46.164 + 7.5 37. 685 = 17.0 44,198 — 2.6 
November....... 43. 233 + 8.6 33.392 — 19.1 44.737 + 7.6 
December........ 41.519 + 4.9 35. 862 — 12.0 45.814 +11.1 
1916 1917 1918 
JANUATY.. cine ves 46.593 + 8.9 47.969 +10.7 42.227 — 4.3 
February........ 45.187 + 20.8 41.353 + 9.4 43.777 + 14.4 
March? sens «+s 43.829 + 6.8 47.869 + 16.3 48.113 + 14.9 
April iaetavesatestse 33. 628 +.8 41.854 + 19.4 46.041 + 28.8 
Mayiscvsiecrasce ns 38.804 + 4.0 47.086 + 23.1 50. 443 + 30.4 
PUNE. insta nae 37.742 —1.7 46.824 + 19.4 51.1388 + 29.0 
Duly emer n cers 38.113 — 3.0 46. 292 + 16.0 54.971 + 36.3 
AUgUSts sss ioc 0s 42.696 ray h 47.372 + 9.5 55.114 + 27.6 
September....... ; 42.098 — 2.8 45.108 + 4.0 51.183 +18.0 
October. .c25<0+ 44.807 —2.7 48.337 + 4.8 52.300 + 13.7 
November....... 44,927 + 6.5 47.690 + 12.1 43.895 +2.3 
December....... 44.098 + 5.8 44.037 + 4.2 40.184 —5.7 
1919 1920 1921 
GAMUALY nb ie tosses 42.193 — 4.8 49.748 + 18.5 41.148 — 6.8 
February........ 32.103 — 14.3 41.055 + 7.5 31.524 — 15.2 
March <i. facics 34.293 — 19.0 47.850 + 14.0 31.055 — 26.3 
Aprils teas sieiscease 82), 712 — 8.8 88.764 + 11.0 28.154 — 14.3 
Maystmnacseh ccs 38.186 + .4 39.841 + 4.6 34.057 —8.8 
TUNE M Ailes casi 87. 685 — $3.8 46.095 + 16.8 34.635 — 10.4 
SUL Geeyatais, sister aty ¢ 43.425 + 8.1 45.988 + 14.6 31.047 — 21.1 
AUZUSE. seine ect 43,613 —.4 49.974 + 15.8 35.291 — 19.6 
September....... 48.209 + 10.7 50.241 + 16.1 35.893 — 18.1 
October......... 57.200 + 25.5 53.278 + 16.5 44. 686 — 3.4 
November....... 19.006 — 58.0 55.276 + 80.5 36.805 — 19.6 
December....... 87. 235 — 12.8 53.257 + 26.6 31, 627 — 25.3 
1922 1923 
January......... 37.600 — 14.5 50.123 + 18.1 
February........ 40.951 +8.8 42.160 + 13.4 
Marchimnccccnrase 50,193 + 21.7 46.807 + 15.2 
ADE a enccae outs 15.780 — 44.1 42.564 + 28.7 
May isciteestiens = 20.501 — 41.4 46.076 + 23.7 
JUHE\ acoaseererrerecs 22.309 — 39.8 45,490 + 19.38 
July ee ee 17, 003 — 54.9 45.644 +17.9 
AUSUSE a ie sie ess 22.328 — 60.7 48.864 + 17.2 
September....... 40.964 — 4.3 46.216 + 10.7 
October......... 45.173 —.7 49.171 Sons 
November....... 45. 262 +8.7 42.946 +4.7 
December....... 46.450 12.8 39.707 —22.3 


MEASUREMENT OF SEASONAL FLUCTUATIONS 339 


C761 


(sqquour Aq) SZ6I-SI6I ‘s91%19 poyUQ 24} Ur [VoD snourwinzig jo wojonporg — “gg ‘org 
TEI IZ61 OZ6I 6I6I SI6L _LIGI 9I6I SI6I 


FIGI 


<I6I 


SUOTT [TUL Ut 
Uolzonporg enjoy” 


340 THE ANALYSIS OF TIME SERIES 


The actual monthly production figures are plotted in 
Fig. 64 and the cycles (corrected deviations from normal) 
are plotted in Fig. 65. These cycles reflect with a fair 
degree of accuracy the cycles in general business, though 
in the case of bituminous coal accidental and irregular 
fluctuations, which are not eliminated by the process 
described above, are more important than in most economic 
series. Thus the low production in November, 1919, and 
during part of 1922 was due primarily to labor troubles in 
the bituminous fields. Account must be taken of such 
irregularities in analyzing. any economic series. 

If cyclical changes in this series are to be compared with 
similar changes in other series, it is desirable to reduce the 
figures to a form permitting such comparison. ‘The per- 
centage deviations might be much more violent in one series 
than in another, and without a common denominator com- 
parison would be difficult. This common denominator is 
afforded by the standard deviation. The monthly or annual 
deviations may be expressed in terms of the standard 
deviation as the unit of measurement, if such comparison 
is to be made. 

The process of analysis has now been completed. We 
have, for the given series, the equation to the line of secular 
trend, and from this the normal or trend value at any given 
date may be computed. The seasonal variations have been 
measured, and indices of these variations computed. 
Finally, the cyclical fluctuations (plus the unmeasurable 
random and accidental changes) have been isolated. Fur- ’ 
ther analysis of these cyclical fluctuations, looking toward 
the elimination of the random changes and the breaking 
up of the cyclical variations into simpler forms, is mathe- 
matically possible, but the greatest care is needed in apply- 
ing such refined methods to economic data. A discussion 
of the technique of this procedure is beyond the scope of 
the present work. 


MEASUREMENT OF SEASONAL FLUCTUATIONS 341 


Fic. 65. — Cyclical and Accidental Fluctuations in Bituminous Coal Production in the United States, 1913-1923 


342 THE ANALYSIS OF TIME SERIES 


GENERAL CONSIDERATIONS RELATING TO THE 
ANALYSIS OF SERIES IN TIME 


The following are certain general points which should be 
observed in the analysis of time series. Some of these have 
been noted above, and are here summarized. 


1. It is a fundamental requirement that the data be homogeneous 
for the period covered. This applies not only to external 
phases, to sources, methods of quotation, etc., but to under- 
lying economic and social conditions. Concretely, this 
bears directly upon the study of two of the factors described 
above: 

a. The representation of secular trend by a single curve is 
justified only if conditions affecting the trend have 
remained unchanged during the entire period covered. 

b. Indices of seasonal variation are to be computed only 
for periods marked by a fairly uniform seasonal fluctua- 
tion. This means that abnormal periods should not be 
included and that new indices should be constructed 
when underlying conditions change. The construction 
of such indices should be preceded by a careful pre- 
liminary study of the data and of the attendant factors 
in order to be sure that the conditions responsible for 
seasonal variation have undergone no change during 
the period covered. 

2. In deriving the equation to a line of trend, a period containing 
a whole number of cycles should be included. The years 
selected as terminal should be approximately normal with 
respect to the particular series being studied. 

3. For the determination of a line of trend and the calculation 
of indices of seasonal variation, data extending over as long 
a period as possible should be employed. Ten years may 
be suggested as the minimum period, though a much longer 
term of years is desirable. 

4. Time series expressed in terms of dollars yet representing 
physical quantities should, in general, be corrected for vari- 
ations in the purchasing power of the dollar before further 
analysis is attempted. Bank clearings, value of exports, 


MEASUREMENT OF SEASONAL FLUCTUATIONS 343 


and value of building permits issued are examples of such 
series. Care should be taken that the price index employed 
in making the correction is an appropriate one. 


REFERENCES 


Cuappock, R. E. Principles and Methods of Statistics (Chap. 
XIII). 

Crum, W. L. The Use of the Median in Determining Seasonal 
Variation. Journal of the American Statistical Association, 
March, 1923. 

FaLtkner, Heten D. The Measurement of Seasonal Variation. 
Journal of the American Statistical Association, June, 1924. 

Hay, Lincotn W. Seasonal Variation as a Relative of Secular 
Trend. Journal of the American Statistical Association, 
June, 1924. 

Hart, W. L. The Method of Monthly Means for Determination 
of a Seasonal Variation. Journal of the American Statistical 
Association, Sept., 1922. 

Kine, W. I. An Improved Method for Measuring the Seasonal 
Factor. Journal of the American Statistical Association, 
Sept., 1924. (In this article Dr. King proposes a method by 
which year-to-year changes in the seasonal factor may be 
measured. ) 

Moors, H. L. Economic Cycles: Their Law and Cause. 

Moors, H. L. Generating Economic Cycles. 

Persons, W. M. Correlation of Time Series. (In Handbook of 
Mathematical Statistics, edited by H. L. Rietz, 151-158.) 

Persons, W. M. Indices of Business Conditions. Review of 
Economic Statistics. Prel. Vol. I, 1919. 


CHAPTER IX 
INDEX NUMBERS OF PHYSICAL VOLUME 


Economists and business men who are awake to the 
larger implications of their activities have long lamented 
the dearth of production statistics in the United States. 
For a number of staple raw materials fairly accurate 
production figures are available, but for the great mass of 
manufactured commodities no reliable data are to be had. 
Recent activity by governmental departments and by 
private agencies is tending to correct this condition, and in 
the future accurate and comprehensive data on the course 
of production will undoubtedly be available. Such data are 
vitally necessary to the student of prices and to the student 
of business cycles, as well as to business men with particular 
interests and economists with a more general interest in the 
business system. 

The necessity of combining production figures in order to 
secure some sort of an average arises as soon as one’s interest 
ceases to be specific, confined to one or two individual 
commodities. He who would study the course of general 
production faces much the same problem as does the student 
of general price movements. If the general trend of pro- 
duction is to be determined, or if the cyclical or seasonal * 
swings of production are to be studied, the mass of individual 
figures must be reduced to the form of a single index, the 
significance of which may be easily comprehended. In 1919 
such an index was compiled by the Price Section of the War 
Industries Board, under the direction of W. C. Mitchell. 
This index related to the production of 90 raw materials, 
and covered the period 1913-1918. More recently a number 
of similar index numbers, broader in scope and covering 

344 


INDEX NUMBERS OF PHYSICAL VOLUME 345 


longer periods, have been constructed. As illustrations 
of method certain of these types may be briefly described. 


Propuction InDEXx oF THE War INDUSTRIES BoarD 


The compilation of a production index by the Price Sec- 
tion of the War Industries Board was incidental to its major 
work in the field of prices. The method employed was 
similar to that followed in the construction of the price 
index. In the latter case the monthly and yearly prices of 
given commodities were multiplied by the amounts produced 
plus the amounts imported in 1917, the aggregate values 
secured by totaling these results being turned into price 
index numbers. The physical quantities were constant 
while prices varied. In compiling the production index 
the amounts of given commodities produced plus the 
amounts imported, year by year, were multiplied by 1917 
prices. Value aggregates were secured by years, and these 
were turned into relatives, which constitute production 
index numbers.!' Prices remaining constant throughout, 
the aggregate yearly values would be affected solely by 
changes in the amounts produced and imported. 

Three sets of aggregates were worked up for these 90 
raw materials, one representing yearly production times 
1917 prices, one representing yearly prices times 1917 
production, and the third yearly prices times yearly pro- 
duction. Upon the first of these the index of production 
was based, upon the second the index of prices, and upon 
the third an index of actual value changes, year by year, 
which represents the combined effect of changes in pro- 
duction and prices. These aggregates and index numbers 
are given in the accompanying table. 


1 The individual commodities were again weighted, the weights being pro- 
portionate to the number of manufacturing operations employed in the fabricating 
process. Thus raw mineral products were given more weight than raw farm 
products, because many oi the latter undergo no manufacturing process. The 
weights employed were not explained in detail. 


346 INDEX NUMBERS OF PHYSICAL VOLUME 


TABLE 88 


Index Numbers of Production, Prices and Annual Values of 
90 Raw Materials 


1913-1918 4 
Aggregate Values (in millions of dollars) Index Numbers 
Yearly production Yearly prices times oe arly p hers Pro- Piceee Bee 
Year times 1917 prices 1917 production PN Seeae) duction 
production Prod. 
1913 $30,375 $19,973 $17,390 100 100 100 
1914 30,207 19,224 16,694 99 96 96 
1915 32,482 19,699 18,455 107 99 106 
1916 33,700 23,363 22,785 lll 117 131 
1917 34,748 34,748 34,748 114 174 200 
1918 35,169 38,251 39,153 116 192 225 


The first index shows that the greatest increase in pro- 
duction came between 1914 and 1916, with much smaller 
increases in 1917 and 1918. The price increase is similar to 
that which has been revealed by other index numbers. 
The third column, in which both the production and price 
factors are variable, presents figures similar to those often 
quoted to show how production increased during the war. 
The increase in values was largely due to price changes, 
reflecting to only a small degree an actual increase in the 
physical volume of production. 

The method employed above may be described by the 
following formula. For the index of changes in quantities 
produced in 1914, as compared with 1913 as base, we have 


Dqioi4 * P1917 r 
Yqisis * P1917 


This, it will be noted, is one of the aggregative types tested 
by Fisher in developing his “‘ideal’’ index except that the 
p’s and the q’s are interchanged. This ideal index, and the 

1 From History of Prices During the War, Price Bulletin No. 1 of the War In- 
dustries Board, 45. The original table gives separate index numbers for five 


raw material groups: vegetable farm products, animal farm products, forest prod- 
ucts, mine products, and fishery products. 


INDEX NUMBERS OF PHYSICAL VOLUME 347 


simplified forms recommended for ordinary use, are as 
appropriate for the measurement of quantity changes as 
for price changes. The ideal index, when used for this 
purpose, would take the form 


2 (qiPo) . 2 (qp1) 
2 (qoPo) “> 2 (qopr) 


where q, and p, represent the quantities and prices of the 
individual commodities in the base year, while q and 7, 
represent quantities and prices in the given year. The 
procedure in the computation of such an index is identical 
with that employed in computing the ideal price index with 
prices and quantities reversed. 


Day’s InpEx NumMBERS OF PRODUCTION 


Edmund E. Day has compiled two of the most useful 
index numbers of the physical volume of production which 
are currently available. The two are constructed on 
different principles and are designed to serve somewhat 
different purposes. The characteristics of the one which is 
termed an ‘“‘unadjusted index”’ of the volume of production 
of basic materials in the United States may be first con- 
sidered. 

This index covers the period 1899 to date. Several 
changes have been made in the constituent series since the 
index was first constructed. The present description relates 
to those utilized for the period 1909-1922. 

In the compilation of this index available production 
data were divided into four groups, those relating to agri- 
cultural products, to products of animal husbandry, to 
forestry products and to mineral products. A separate 
index number has been computed for each of these groups, 
as well as for the entire division of basic materials. ‘The 
classification is based upon the fact that there are significant 
differences in the conditions affecting production in these 
four groups. 


348 INDEX NUMBERS OF PHYSICAL VOLUME 


The methods employed in the construction of the un- 
adjusted index are practically the same as those already 
explained in dealing with price index numbers. The 
individual production figures are first reduced to relatives, 
the figure for the year 1919 serving as base. (The average 
for the years 1919-1920 is taken as base in the case of 
agricultural products.) The index for each of the four minor 
groups is a weighted geometric average of these relatives 
(except in the case of forestry products, the index for which 
is based upon a single series, lumber cut in the United 
States). The weights are based upon the relative importance 
of the different commodities in each group, as measured by 
aggregate values in the base year. 

The four group index numbers thus computed are based 
upon different numbers of commodities. For the period 
1909-1922 these numbers were as follows: 


Asricultural Products: .5.2..- 9... 16 
Products of Animal Husbandry... 8 
Borestry Products: = 5-2" eee 1 
Mineral ProdtictS =. 20. sae at 13 


(Figures for 16 mineral products were utilized for the later years of this period.) 


The final step is the combination of these group index 
numbers into a single index of the production of basic 
materials. This final index is a weighted geometric mean of 
the four group index numbers, weights being proportionate 
to the aggregate values of production in the four fields 
during the base year. 

By precisely similar methods an index of the volume of 
manufacture for the period 1914-1922 has been constructed 
by Day. Certain additional problems were faced in this 
field, particularly in the matter of weighting. For the 
period 1914-1918 each annual index number is a geomet- 
ric mean of two indices having, respectively, 1914 and 1919 
weights. For the rest of the period 1919 weights are used. 
This index is based upon 31 series representing commodi- 


INDEX NUMBERS OF PHYSICAL VOLUME 349 


ties produced or materials consumed in current manufac- 
ture. 

The values of the various unadjusted indices are given 
in the following table: 


TABLE 89 


Unadjusted Index Numbers of Production in the United States + 
1909-1922 


(1919 = 100) 


Agriculture, 
Agriculture Animal ___ | Forestry, Ans- 
Year (crops) Forestry | rshan dry Mining |\mal Husbandry} Manufacture 
and Mining, 
Combined 
1909 86.0 128.8 82.8 74.3 83 .4 
1910 88.5 115.8 75.0 063 Laff 82.6 
1911 83.8 107.1 87.0 81.0 84.8 
1912 99.6 113.3 84.3 86.7 91.7 
1913 88.4 1A | 83.9 93.0 87.8 a6 
1914 99.5 108.1 80.3 87.6 90.1 75 
1915 104.9 90.4 86.1 93.3 95.3 86 
1916 91.6 100.7 93.6 104.8 94.6 102 
1917 97.5 96.0 93.2 112.2 98.1 105 
1918 98.1 84.9 103.2 112.7 102.0 102 
1919 100.0 100.0 100.0 100.0 100.0 100 
1920 109.2 86.4 92.5 114.2 103.0 104 
1921 90.5 fhesaul 94.9 91.6 92.0 80 
1922 100.7 100.7 102.1 97.5 100.7 109 


A comparison of these indices yields interesting results. 
There are pronouneed and characteristic differences in the 
trends of the various series and there are equally marked 
differences in the fluctuations in production from year to 
year. As was pointed out in the discussion of index numbers 
of prices, the weather is generally the determining factor 
in agricultural production, while in other fields production 
is controlled with more immediate reference to existing 
prices and business conditions. 


1 These figures are from the Review of Economic Statistics, July, 1922, and 
July, 1923. 


350 INDEX NUMBERS OF PHYSICAL VOLUME 


Tur Apsustep INpDEx NUMBERS OF PRODUCTION 


We have seen, in the analysis of time series, that the 
cyclical fluctuations in such series are often the objects of 
primary interest, particularly to one who is following 
changes in business conditions. ‘This is particularly true 
in the study of physical volume, for changes in the volume 
of production and trade are features of fundamental im- 
portance in the business cycle. Methods have been ex- 
plained, in the preceding chapters, by means of which we 
may measure such cyclical fluctuations in individual series. 
An obvious next step, in the study of general business con- 
ditions, is the combination of cyclical movements in a 
number of series into a single index. The utility of such an 
index of changes in the physical volume of production 
occurring in the course of the business cycle is evident. 

When annual data are employed the construction of an 
index of these cyclical changes is simple. No problem of 
seasonal variation enters, and secular trend alone has to be 
taken account of. Two different methods by which this 
may be done present themselves. Day, in computing his 
“adjusted”? index of the physical volume of production, 
has tested both methods. 

The first involves the fitting of an appropriate line of 
trend to each of the constituent series. The actual items 
are then expressed as percentages of the corresponding 
trend values. When this has been done for each series, the 
final adjusted index for a given year is obtained by taking ° 
a weighted arithmetic average of these percentages for that 
year. Each commodity is given the same weight, in this 
averaging process, as in the calculation of the unadjusted 
index. The resulting adjusted index is in terms of relatives, 
but these relatives refer to a hypothetical ‘“‘normal,”’ 
instead of to any fixed base. This is the desired index of 
cyclical changes in the physical volume of production. 

The alternative method is simpler. Each unadjusted 


INDEX NUMBERS OF PHYSICAL VOLUME 351 


index possesses a trend which is “a composite of the per- 
sistent tendencies of the several original series upon which 
the unadjusted index is based.”” Why not measure this 
trend, instead of the separate original trends, and secure 
the adjusted index directly from the unadjusted? 

The validity of the latter method may be tested by com- 
paring the results secured by the two methods. Day has 
done this and has found that the results are practically 
identical. Accordingly, the simpler process has been 
employed in securing the required index numbers. 

The adjusted indices for agriculture and mining, corres- 
ponding to the unadjusted values given above, are presented 
in the following table: 


TABLE 90 
Adjusted Index Numbers of Production in the United States} 
1909-1922 
Year Agriculture Mining 
(crops) 
1909 98 97 
1910 99 102 
1911 93 98 
1912 109 101 ~ 
1913 95 104 
1914 106 95 
1915 110 98 
1916 95 107 
1917 100 111 
1918 99 108 
1919 99 93 
1920 107 104 
1921 88 81 
1922 96 84 


InpEx Numpers Basep oN MontTHLy VALUES 


The preceding examples have dealt only with annual 
index numbers. By a simple elaboration of the same 
methods index numbers may be constructed from monthly 
values of series relating to the physical volume of produc- 


1 These index numbers are from the Review of Economic Statistics, July, 1922, 
and July, 1923. 


352 INDEX NUMBERS OF PHYSICAL VOLUME 


tion and trade. One of the most comprehensive of these 
indices is that compiled and published monthly by the 
Federal Reserve Board. 


FEDERAL RESERVE Boarp INDEX OF PRODUCTION 
IN Basic INDUSTRIES 


This index has been constructed for the period from 
January, 1913, to date, being published monthly in the 
Federal Reserve Bulletin. It is based upon 22 statistical 
series for which monthly data are currently available. 
Twenty of these series are listed in Table 91 below. In 
compiling the final index weights are given to the various 
series in accordance with their relative importance, as 
determined from the 1919 Census of Manufactures. Data 
on value added in manufacture and number of men employed 
are used in arriving at the weights. 

In the construction of this index no attempt is made to 
eliminate the effect of secular trend. The final index appears 
as a relative, with average monthly production in 1919 as 
base, but this base is not considered to be “‘normal”’ in 
the sense in which ordinates to a line of trend are taken as 
“normal.” 

Before an average is secured the various series are reduced 
to relatives, on the 1919 base. Seasonal variations are 
eliminated so far as possible, in the calculation of these 
relatives, by correcting the 1919 monthly averages by 
indices of seasonal variation. The method by which this 
is accomplished is worthy of note.! 

The index numbers of seasonal variation are medians of 
monthly deviations from a twelve-month moving average, 
computed by a method which was explained in the preceding 
chapter. As considerable interest attaches to seasonal 
fluctuations in the series included, the indices secured are 
summarized in the following table: 


1 This method, in its particular application to this problem, is due to Frederick 
R. Macaulay, of the National Bureau of Economic Research. 


INDEX NUMBERS OF PHYSICAL VOLUME 353 


$8 
98 
6 
66 
98 
601 
16 
66 
001 
s0L 
GL 
901 
18 
IIL 
9FL 
sg 
Tél 
146 
$6 
86 


4aquavog 


o01 
FOL 
36 
Ort 
66 
SoI 
601 
09 
FOL 
96 
oor 
601 


49quaao Ny 


801 
OIL 
SII 
801 
IL 
S01 
S01 
sol 
OIL 
SIL 
601 
AIL 
Oot 
881 
$8 

IZ 

FOL 
66 

LOL 
SOI 


4290792) 


901 
FOL 
£01 
66 
AIL 
$6 
146 
46 
66 
901 
OIL 
061 
66 
IIL 
69 
66 
SII 
96 
10T 
10L 


—————___ 


saquiajdag 


‘OLFL ‘S861 “J9q “‘uyayng IaLOSIY (PLapaq 2q4 Wolly 7 


LOL 
801 
SOL 
s0L 
6IT 
101 
101 
Oot 
s0L 
901 
SII 
OIL 
96 

66 

$9 

961 
901 
00L 
FOL 
POL 


jsnbnp 


00r 
FIL 
sol 
Sor 
601 
86 
86 
L6 
0or 
L6 
LOL 
001 
601 
96 
t8 
LOT 


66 
001 
00T 


Ane 


ool ool 
901 66 

SOL oor 
10l ool 
LIL 081 
o0L I0l 
601 801 
601 FOL 
Sor SOL 
96 &6 

SII 6II 
06 €8 

FIL 831 
66 98 

901 66 

8oI Tél 
6L #8 

oor SOT 
66 60L 
86 oor 
aung | foggy 


SSI 
68 

00r 
00 
001 


nady | youn zy 


LOT 
66 
00r 
60r 
€8 
TOL 
oor 
SOL 
OOT 
60L 
46 
16 
601 
88 
66 
861 
68 
LOL 
601 
sol 


__ 


Asonsg aq 


6 “*"""**0908q04 peinzoejnuen jo uorjonpoig 
06> Ob Fest Sieteietvicletale tee *soqqores10 jo uoljonpolg 
68 DTT" ***se8 Jo uornpory 
86 Sal ttt Lace oe CORON por tangjoneagq 
69 sy Foor 2 *92°--TOronpold queursay 
901 eit blevaNelgitesersiin “uorjonpoid quridsmayy 
201 |/ 117" *****uornonposd sayzeaz apog 
O01 Ore "7+ +555" -uoronpoad Joddog 
L6 Sees eewsenccce “worjonpoid [e090 s}oeIgjuy 
LOL O'R a) 016, 0:46, eles ee uorjonpoid [v0o snourainjzig 
0g ‘ Hitt teteesess sang gaqamng 
FOL ib ietaeiwinie’ lee *parazqnvjs dasyg 
68 pies iaiere/ eis peiezysneys SaAltg 
501 =ys\iez “pasezysneys ) Bee) 
SEL “""""*"pasazysneys sso 
69 aie) \@ (ee "ele in ete lal ale sorts ss8urqyour qesng 
3IL nieis,eleiela tates pie Wile\'e-6i0 “uoronpoid anog yea MA 
FOL ITO UIC MP Scan A uonjdunsuoo m0}}09 
66 mite w\wiw lu tePartovee *s}o3ur [9938 Jo uonjonporg 
66 aad 5 "7st" 5"*** "on npoid uod 31g 

Se 
hapnunp saitag 


1 S22428' [DOUYSYDIG ULDp1AQ UL UoYMruD A PPuosvagy fo ssoqunyy xapuy 


16 Wavy, 


354 INDEX NUMBERS OF PHYSICAL VOLUME 


These index numbers, which are based upon data for the 
period 1913-19, inclusive, are quite significant mm the light 
they throw upon seasonal fluctuations in different industrial 
groups. 

The correction for seasonal variation is made in the 
figures used as bases for the various monthly relatives. 
The method may be illustrated by employing the figures 
for cotton consumption, as given in the Federal Reserve 
Bulletin. In the table which follows the actual figures for 
1919, the indices of seasonal variation and the monthly 
averages for 1919, adjusted for seasonal variation, are 
presented. 

TABLE 92 
Application of Corrections for Seasonal Variation 


(Cotton Consumption) 


(1) Gy (3) (4) 
ent te Gna | Indices of | Corrected monthly 

of bales) seasonal variation averages, 1919 
Vanntta nya: 5,569 104 5,130 
February... ..... 4,333 97 4,785 
Marc nue ctyace 4,335 107 5,278 
Sprilicnen sats ers 4,759 100 4,933 
Misiytel ote.cisee Sere s 4,879 105 5,179 
AABTIVh ar Rane can 4,743 102 5,032 
GUNN errata meat 5,103 99 4,884 
PAUP US baie tia atonts 4,973 100 4,933 
September....... 4,911 95 4,686 
Octobera oun cose: 5,560 99 4,884 
November....... 4,913 95 4,686 
December........ 5,117 97 4,785 

59,195 59,195 
Monthly average 4,933 


The monthly average for 1919 is 4,933 (hundreds of 
bales). If there were no seasonal variation it would be 
correct to assume that this, apart from trend, and acciden- 
tal fluctuations, would have been the actual production 


INDEX NUMBERS OF PHYSICAL VOLUME § 355 


in each month of that year. But in constructing the indices 
of seasonal movement it has been found that January is, 
on the average, 4 per cent above the average for the year, 
February 3 per cent below, ete. Therefore, by multiplying 
4,933 by the seasonal indices, we secure for the several 
months corrected monthly averages which indicate how the 
actual annual production in 1919 would have been distri- 
buted throughout the year if the force of seasonal variation 
were the only disturbing factor. These corrected monthly 
averages are used as the bases in the reduction of all actual 
figures to relatives. 

For example, actual cotton consumption in January, 
1921, amounted to 3,665 (hundreds of bales). To reduce 
this to a relative on the 1919 base, we divide this figure 
by 5,130, the corrected January figure for 1919, and multiply 
by 100. We secure as the relative for January, 1921, 
corrected for seasonal variation, the figure 71.4. To secure 
the February relative, the corrected February figure for 
1919 is used as base. It is a great advantage of this method 
that the base is adjusted once and for all for seasonal 
variation, no later corrections being necessary. 

The final index for a given month is secured readily from 
corrected relatives such as this, representing all the con- 
stituent series. Each relative is multiplied by its weight, 
the products are added, and the total divided by the sum 
of the weights to give the desired index. The published 
figures include the corrected relatives, as well as the final 
index of all commodities, a practice which adds materially 
to the usefulness of this work. 

In addition to this general index of production in basic 
industries, the Division of Analysis and Research of the 
Federal Reserve Board compiles and publishes a set of 
useful index numbers of activity in specific lines. These 
include an index of agricultural movements, an index of 
mineral production and an index of manufacturing produc- 
tion. No correction for seasonal variation or secular trend 


356 


ty 


Rises 


General Business <Activi 


y PU TT 


yy WT 
Pye Ty 
Beem 
Boa 


= 
| | MT ah ¢ 


Boe 
re 
H 


SREB 
Bae 
| hey TT 
Se 
ere 


ay 


AY 

| 

| 
BARS 


Ras 

i t-/ 

vA TY 
poles ess 


he 


rig 
Wi \ 
TP Pil 


Nn at on 


PCE 


Baoan 


| Af) | a 
aha 


ay 


Bn ma) 
ly 
SEB rig 


oe 


BRIE 
Baie 


Fic. 66 
(Index compiled by Statistical Division, American Telephone and Telegraph Company) 


INDEX NUMBERS OF PHYSICAL VOLUME 


is attempted in the compila- 
tion of these index numbers. 
They are available, by months, 
for the period beginning Jan- 
uary, 1919. 


THE “GENERAL BUSINESS” 
INDEX OF THE AMERICAN 
TELEPHONE AND TELE- 
GRAPH COMPANY 


A somewhat different pro- 
cedure is exemplified in the 
composite index of general 
business conditions compiled 
by the Statistical Division of 
the American Telephone and 
Telegraph Company. This 
index has been computed by 
months, going back to 1877, 
though its. composition was 
changed a number of times 
during this period. 

In the construction of the 
index as it stood at the end 
of 1923 eleven different series 
which were taken to reflect 
general business activity were | 
employed. Each of these 
series was analyzed in the 
manner explained in the two 
preceding chapters. Lines of 
secular trend were fitted, the 
actual figures were expressed 
as percentages of the normal 
values thus secured, and 


INDEX NUMBERS OF PHYSICAL VOLUME | 357 


these percentages were corrected for seasonal variation. A 
further step had to be taken before the eleven series were 
combined into one. Certain series were marked by very 
much wider cyclical fluctuations than others, and before any 
averaging was attempted they all had to be reduced to 
comparable units. If this had not been done, the series 
having fluctuations of the greatest amplitude would have 
been over-weighted. This correction was made by express- 
ing the deviations from normal for each series in terms of 
the standard deviation of that series. 

The series which were combined into the final index, as 
of December, 1923, with their respective weights, were 
the following: 


Series Weight 
IENCRITOnEpPTOCUCLION wa mol 6 eek Ae cao iciacie 20 
INcttonmmilessoimiteight yeoman cco 10 
rei htacar loadinasmeg erste bee aoe acre aie 10 
Steelmmgot production-mecmer he eee 10 
Alctivity.ot woolkmachinery. sac. occ ane oe 10 
Cation COusiMpPON, geo eas 6 Bite Ss oe hewn 10 
EAD Er DTOUUCLION coi centre ccc ¢ warm ae/arend ier 10 
[iM bersproguchionvess a. ici arte sr 5 
A LMOR PEOGUCTIONS aire ited ool es ne ees eles 5 
Bituminous:coalproduction:..........41.- 0-4 5 
Electric power production.................. eo 

Total 100 


The values of this index, for the period 1877-1923 in- 
clusive, are given in the following table. These figures are 
plotted in Fig. 66. 


3583 INDEX NUMBERS OF PHYSICAL VOLUME 


TABLE 93 


Composite Index of General Business Activity, 1877-1923 


Compiled by the American Telephone and Telegraph Company * 


(The figures are in terms of percentage deviations from normal.) 


1877 | 1878 | 1879 | 1880 | 1881 | 1882 | 1883 | 1884 | 1885 | 1886 | 1887 | 1888 
January... -9 -—8 | -—18 +7} +12] 411 +9 | —10 | —16*; —10 —2 —2 
February... | —9 -—9 | —13 +9 | +12 |] +11 +7 —-9 | -19 —8 +3 —2 
March..... —10 | —10 | —13 | +11 | +12 | +12 +4 —8} -l17 —6 +9 —7 
Apriliic.... —11 | —11 | —13 | +12 | +12 | +12 +2 —8 | —16 8 +8 —6 
May topcase —10 | —11 | —11 | +10 | +12) +10 +2 —7 |) -—19 —7 +8 —2 
DUNE v6 5.000 —8 | —12 -9 +8 | +13 +8 +1 —6| —14 +3 +6 —-2 
Hiieonecen —7 | —12 —7 +7) +13 +5 +1 —6] -ll +4 —-2 —§ 
August.... —7 | —12 —-3 +7 | +13 +7 +1 —8 | -—15 +2 +2 +2 
September. | —7 | —12 +2 +7} +12 +9 +2] —-—10} —14 +2 +7 +2 
October ... —7 | ~12 +6 +7) +12] +10 +2) —12 | —12 +1 +5 +8 
November. —7 | —12 +6 +9 | +12 | +10 —2) -15] -1l +2 +7 +2 
December . —8 | -—13 +7) +11) +11 +9 —6 | —18 -9 +3 +1 +3 
1889 | 1890 | 1891 | 1892 | 1893 | 1894 | 1895 | 1896 | 1897 | 1898 | 1899 | 1900 
January... +6 | +11 +5 +9*) +12] —14 —3 -—1 —13 -—3 +1) +10 
February... | +5 +8 +1] +14 | +12] —16 -9 —2)-l11 —2 +1] +11 
March..... +3 | +10 —5 | +11 | +12 | —14] —-10 —7 | —12 —3 +4} +10 
April...... +1 | +12 —6 +9 | +11 —14}] —11 —5 | —12 —7 +1 +8 
MOR Seka te +4 | +17 —7 +7 | +11 —14 —8 —8 | —15 —7 +2 +6 
June...... +2 | +15 0 +9 +5 | —20 —6 -—8 | -13 —4 +3 +6 
Dulyceearvce +7 | +16 +8 +6 —2| -18 -—1 —8 | —14 -—8 +3 +3 
August.... +6 | +12 +7 +6 | —18 | —10 0| —14] —10 —5 +6 0 
Sanisrdlyse, |) Gee est) a Ih Sa |P ory ||) Bye) en | =r!) 5) 5) 4a |) = 
October ... +8 | +17 | +12 +8 | —17 —7 +6 | —17 —4 —7 +9 -—3 
November. +7 +8 +7} +10] —16 —6 +4 | —16 —4 —4 +8 —4 
December . +5 +5 +7 | +12) -—15 —4 +3 | -—13 —2 —2) +11 —4 
1901} 1902 | 1903 | 1904 | 1905 | 1906 | 1907 | 1908 | 1909 | 1910 | 1911 | 1912 
January... | —1 +2 +3*| —10 —5 | +10 | +17 | —14 —5*| +12 0 —2 
February. . 0 +2 +1 —5 —5 +9 | +14] -15 —5 |} +10 -1 +3 
March ae | ae || ee heey Oe) cteyel| tes || aie) sane) cee Ole ees 
April. Jv. +5 +5 +4 —6 =—1 +5 | +16] —14 —5 | +11 -$ +6 
IM ayers +4 +5 +8 —8 0 +8 | +18 | —17 —$ +6 —2 +5 
June...... +3 +2 +2 | —10 0 +7 | +16 | —17 —2 +7 ill +38 
UN eo con pe +4 +4 +3 | —12 —1 +7 | +18 | —16 +1 +4 — 2 +5 
August.... +4 +3 —1 —12 0 +8 | +16 | —13 +2 +4 ih +6 
September. 0 +6 —2 —8 +2 +5 | +12] —11 +5 +4 =i +5 
Octobersrs | e8h lt 25010 6 Ie —0) eee Ton ets l—1a 0 |fereteo lf meetean | en meng 
November. | +2 +2] —11 —4 +4] +11 —4 —9 +9 +2 -—1 +7 
December . | +1 +3 | —14 —5 +7 | +13 | —12 —8 | +10 +1 —2 +7 
19138 | 1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 | 1922 | 1923 
January... |+10* 0} —16 } +11] +17 +4 +4*} +13 | —19*] —16 +8 
February. . +8 —2] -13 | +12 | +18 +38 -—8 } +10 | —18] —11 +8 
March..... +3 0; —12) +18 | +14] +10 —8 | +12 | —22] —10/ +10 
Aprilecane +65 —1 -9 +9 | +12 | +10 —6 +9 | —25 | —16*} +9 
Mayers vert. +5 —6 —8 } +12] +14] +13 -2 +6 | —25 | —14] +10 
JUNE es +4 —4 —4) +11 +12 | +10 0 +8 | —23] —11 +7 
Julyemeneies +65 —2 —8 +8 | +10 | +14 +7 +8 | —25 | —12 +6 
August.... +2 {3} —2 | +12 | +10 | +15 +9 +7 | -—21 —10 +5 
September. | +3 | —10 +2) +13 +7 | +11 +8 +5 | —19 —6 +1 
October... | +5 | —13 +5 | +15 +9} +10 +7 —1 | —18 —2 +1* 
November. —1|—-17 +9} +17} +11 +5 +4 —6 |} -15 +4 0 
December . —2) —18 | +14 | +16 +6 +6 +8 | —11 ] —14 +4 -—2 


* Denotes months in which 


composition of curve changes. 


1 This index is published here by courtesy of the American Telephone and 


Telegraph Company. 


INDEX NUMBERS OF PHYSICAL VOLUME — 359 


An INDEX oF THE VOLUME OF TRADE 


By far the most comprehensive index of physical volume 
is that which has been constructed by the Reports Depart- 
ment of the Federal Reserve Bank of New York.' Its 
scope and character may be indicated by a list of the 
statistical series included. The weight assigned to each 
series is noted. 


TABLE 94 


Statistical Series included in the Index of Trade 
(Federal Reserve Bank of New York) 


Productive Activity Weight 
Wonsumers | GOOdS ss 4.6510 Senos 8% 
IBTOCIICEIS 4 GrOOUScr n tneeee ce cen fatten 9 


MRACLOLY WIIpPlOVIMeNts ya sian oltre 6 
Motor Cars and’ ‘Erucks;.%../.>.::.«- Q 
4 


Or YB 99 20 ps 


Building eConstrictionis sor. sen eye: 
29% 


Primary Distribution 
6. Merchandise Car Loadings........... 
MemOCDeTE Gare LOAGINGS 94, scessvae + a rete 
Smminolesale: Prades. a6 ged ners. wacko 


HN OMBLINPOTUSm eerie ces ecesis tieuiae wieeconte 


© 
& 
3 
S) 
=f 
= 
| mmr WwW wOwOe 


22% 


Distribution to Consumers 
13. Department Store Sales.............. 
i}, Clemson tee) (os no eiaic og CII Oe 
5 we ChaineGrocery ales... 46.2665. o> 
Gu ViaieOrderisalesiins ae vr e)-lde hve tse elt 
Iie New Wate: Insurance: . qc). o-k ioe 
18. Amusement Receipts................ 
1G MPACVENLISINO Mire otto dpvtie eres od tah 


| wrwwwawn 


26% 


1 For a detailed description of this index see “A New Index of the Volume of 
Trade,” Carl Snyder, Journal of the American Statistical Association, Dec., 1923, 
49-63. 


360 INDEX NUMBERS OF PHYSICAL VOLUME 


TaBLe 94 — Continued 
General Business Activity 
205 Outside Debitsmeeean secon eerie 8 
21. New York City Debits.............. 5 
00 PostaleReceiptsaene merer ieee 1 
23 COMMUNICATION ee aero ole eel ae 1 
2 


17% 
Financial Activity 
25. Shares Sold on New York Stock Ex- 
Changer sa nee So ner eae Q 
26. New Corporate Financing............ Q 
27. Grain Future Sales in Chicago....... 1 
28. Cotton Future Sales in New York and 
INewsOrleanssam sc icistco cate or roe 1 
— 6% 
100% 


Certain of these series are composites of more elementary 
data. Thus the series showing the production of consumers’ 
goods is based upon fifteen subordinate items, and the same 
number are employed in deriving the series relating to 
producers’ goods. 

Certain special problems were faced in deciding upon 
weights for certain of these series. In general the weights 
were based upon value added in manufacture, value of 
products, or value of commodities exchanged. 

In combining these series into a single index, methods 
already familiar to the student were employed. A “normal” 
was determined for each series by the fitting of an appro- 
priate line of trend. Seasonal indices were computed in all ° 
cases in which a strong seasonal element was present. The 
problem of deflation was a particularly acute one, because 
of the many price series used. Appropriate deflating indices 
were readily available in some cases, but for many series 
deflating index numbers had to be constructed by a com- 
bination of several different series. Thus, for deflating 
debits to individual account outside New York, a deflating 
index was made up of the following constituents: 


INDEX NUMBERS OF PHYSICAL VOLUME 361 


Series Weight 
WiholesalesCommodity Prices 2...) o...6.0 80 oe wae ere seek 2 
Re SUMO EMIS Y 112 NOK ceo Finis sbi no din. 0%, 0 yore Hare e 3 
Index of Average Weekly Earnings of Employees..... 3 
PIC SOUR OL OC TICES coker Pisce 45 <b o ders ave ee abt es iE 


Having made all these corrections in the individual series, 
the final combination was effected. The following values 
were secured for the index for the period 1919-1923: 


TABLE 95 


Index of the Volume of Trade, 1919-1923 
(Compiled by the Federal Reserve Bank of New York) 


L919 1920 1921 1922 1923 
Panuary 2 caizc 97 109 90 93 109 
Bebruary. <. + .-- 97 104 92 96 110 
Marchi ns... 95 108 91 102 113 
ANTS 5 are ae ape 101 105 92 99 109 
TEs ee 106 104 91 103 110 
UM Core cee.2 serene 108 102 92 103 107 
SU Vernet oe ss 110 101 91 97 100 
ANgUSt, . 2 sees - 107 99 95 99 102 
September...... 107 96 94 103 100 
October........ 109 95 92 103 106 
November...... 105 94 92 104 105 
December...... 105 93 93 105 100 


The preceding pages have by no means given a complete 
account of all the current index numbers measuring volume 
of trade and production.! The primary purpose, however, 
has been to exemplify methods, rather than to furnish a list 
of available index numbers. While there is no one standard 
method, it will be clear that the construction of quantity 
index numbers requires no involved procedure. Certain 
knotty problems — of deflation, of determining “‘normal,”’ of 

1 Reference should be made to the Index of Trade for the United States, 
covering the period from 1903 to date, by months, which has been compiled by 
Warren M. Persons and which was described in the Review of Economic Statistics, 
April, 1923; to the monthly index numbers of manufacture and mining published 


in the Review of Economic Statistics; and to the comprehensive index numbers of 
retail and wholesale trade published monthly in the Federal Reserve Bulletin. 


362 INDEX NUMBERS OF PHYSICAL VOLUME 


measuring seasonal variations — are ever present, and further 
experimentation is needed before these can be completely 
solved. Existing methods are giving us, however, a series 
of very valuable measures of business activity which are 
not only materially increasing our knowledge of the economic 
system but are, as well, making possible more effective 
control of that system. 


REFERENCES 

Day, Epmunp E. An Index of the Physical Volume of Production. 
Review of Economic Statistics, Sept., 1920, Jan., 1921. 

Day, Epmunp E. The Volume of Production of Basic Materials 
in the United States. Review of Economic Statistics, July, 
1922. 

Persons, W. M. An Index of Trade for the United States. Review 
of Economic Statistics, April, 1923. 

Snyper, Cart. A New Index of the Volume of Trade. Journal 
of the American Statistical Association, Dec., 1923. 

Snyper, Cart. A New Clearings Index of Business for Fifty 
Years. Journal of the American Statistical Association, 
Sept., 1924, 


CHAPTER X 


THE MEASUREMENT OF RELATIONSHIP: 
LINEAR CORRELATION 


In discussing averages and measures of dispersion and 
skewness we have been dealing with methods of describing 
a single frequency distribution. The arrangement of the 
values of a single variable along a scale may be portrayed 
by means of these measures, which enable the central value 
to be determined and the character of the distribution about 
that central value to be described. In the analysis of time 
series a somewhat different problem has been faced. In 
such cases we are concerned with the changing values of a 
variable factor with the passage of time, and seek to deter- 
mine the degree to which the changes in value are due to the 
play of different forces —the secular trend and cyclical, 
seasonal and accidental factors. The preceding chapters 
dealt with methods by which we might measure the effect 
upon a given series of each of these factors (with the excep- 
tion of accidental fluctuations). 

Certain of these methods are applicable to the problem 
now before us. It was found that in dealing with time series 
the relationship between time and the long term trend 
factor may be described by a definite mathematical equa- 
tion. That is, trend or growth seems to be a function of 
time for many economic series. Where such a relationship 
prevails, whether it hold precisely or only approximately, 
there is a distinct advantage in securing a mathematical 
expression which describes it. A similar but much broader 
problem is now to be discussed. If it is possible in dealing 
with time series to secure a definite mathematical equation 


for the relation between time and the normal values of the 
363 


364 THE MEASUREMENT OF RELATIONSHIP 


items in a given series, cannot the same device be employed 
in studying the relationship between other variables? Can 
we not measure, mathematically, the relation between 
cotton production and the price of cotton, between corn 
yield and rainfall, between earnings and the output of labor? 
If this can be done, it will place in the hands of the econo- 
mist a very powerful tool, giving his methods something 
of the precision which attaches to the work of the physical 
scientist. 


Ture RELATION BETWEEN NuMBER OF TAXABLE PERSONAL 
IncomMEs AND Motor VEHICLE REGISTRATION 


As a typical problem we may consider the relation between 
the number of taxable personal incomes and the number of 
motor vehicles registered, by states. The figures are given 
in columns (2) and (3) of Table 96.1 

These figures are plotted in Fig. 67, each dot representing 
the relation between the number of taxable incomes and 
the number of registered motor vehicles for a given state. 
Such a figure is termed a “scatter diagram.” It is clear 
from this diagram that there is a relationship between the 
two variables. In general, the states with a large number of 
taxable personal incomes are also those having a large 
number of motor vehicle registrations. The relationship, 
however, is not perfect. Two states with the same number 
of taxable incomes may differ quite widely in the number of 
registered vehicles. Thus both Kentucky and Oklahoma 
had 69,000 taxable personal incomes in 1921, yet the’ 
former had 126,000 motor vehicles registered, while the 
latter had 221,000. Were the relationship perfect a single 
value of the Y-variable would always be paired with a 
single value of the X-variable. 


1 The five leading industrial states (New York, Pennsylvania, New Jersey, 
Illinois and Massachusetts) have not been included. The functional relationship 
between personal incomes and motor vehicle ownership is aot the same in indus- 
trial communities as in the country at large, and there is, accordingly, a logical 
justification for their exclusion. 


LINEAR CORRELATION 


TABLE 96 


365 


Taxable Personal Incomes and Motor Vehicle Registration in Forty-three 
States, 1921 


(1) 


State 


Kentucky, ..c:010.06% 21 


WOvisiana sc. cece sss 


Maryland.......... 
Michigan. ....5.+5.3 
Minnesota.......... 
Mississippi......... 
Missouriiccm ss. 2. 


New Hampshire... . 
New Mexico........ 


Wregonyeen tics ae © 


Vermont)...< ose «50 
Mihyaln Cease aceeeon 
Washington........ 
West Virginia...... 
Wisconsin.......... 
Wyoming... nce... 


(2) 

No. of Tazable 
Personal Incomes, 
1921 
(in thousands) 
x 


(3) 

No. of Motor 
Vehicles Regis- 
tered, 1921 
(in thousands) 
Y 


(4) 


(5) 


(6) 


1,215,966 


_———— 


21,904 
484 


603,968 


2,612,066 


366 THE MEASUREMENT OF RELATIONSHIP 


Our first problem is the derivation of an equation to de- 
scribe this relationship which, while not perfect, is clearly 
existent. There is here a relationship analogous to a trend, 
and it is apparently a trend which can be represented by a 
straight line. The equation to a straight line, fitted by the 
method of least squares to the points on the scatter diagram, 
will express mathematically the average relationship between 


S 


PRECHTTE 


ht Ww 
ro) 
S S 


Motor pee (Thousands) 
eS 


O) 2150" 100=2 150 3200s 250 “S00 ta50msoG 
Number of Taxable Personal Incomes in 1921-(Thousands] 


Fic. 67. — Scatter Diagram Showing the Relation between Taxable Personal 
Incomes and Motor Vehicle Registration, by States, in 1921, with Line of 
Average Relationship 


these two variables. Such a line could, of course, be fitted 
by inspection, but a more accurate result will be obtained 
by the method of least squares. 


This calls for the solution of the following normal equa- 
tions: 

2 (Y) = na + b2(X) 
2 (XV) = ad(X) + bX?) 


LINEAR CORRELATION 367 


The values required for the solution of these equations may 
be derived from the data as arranged in Table 96. Sub- 
stituting, we have 
7,616 = 43a + 3,602b 
1,215,966 = 3,602a + 603,968b 

Solving, 

a = 16.92 

b=1.91 


The required equation is 
¥ = 16.92 4+ 1.91X! 


This line is plotted in Fig. 67. 

A mathematical expression has now been secured for the 
relation between the two variables being studied, the number 
of taxable personal incomes, by states, and the number of 
motor vehicles registered. The former is the independent 
or X-variable in the equation, the latter the dependent or 
Y-variable. This equation constitutes a measure of the 
functional relationship between these two variables, but it 
is only an expression of average relationship. How significant 
is the equation? If the relationship were perfect, and the 
plotted points all lay on the line describing this relation- 
ship, the equation could be used with confidence as an 
accurate instrument for determining the value of one 
variable from a value of the other. But a line with a definite 
equation may be fitted to points which depart very widely 
from it, which are widely dispersed. In such a case the 
equation may have the appearance of describing a precise 
relationship but the variation is so great that it cannot 
be used with confidence. It is the same problem as that 
which arises when an average is employed. We must know 
how significant the average is, how great the concentration 

1 In the chapters on correlation capital letters (W, X, Y, etc.) are used to 
represent original values of the variable quantities, as measured from the zero 
points on the scales of actual values. Small letters (w, 2, y, etc.) are used to 


represent values of the variables expressed as deviations from their respective 
arithmetic means, 


368 THE MEASUREMENT OF RELATIONSHIP 


about it, before we may use it intelligently. So the equa- 
tion of relationship between variables means little unless 
we know to what extent it holds in practical experience. 
We must have a measure of the dispersion about the line 
we have fitted. 

In describing the frequency distribution, it has been 
found, the standard deviation is the best general measure 
of variation. It is, obviously, the measure we need in 
determining the reliability of the equation of average 
relationship. The standard deviation about this line will 
not only serve as a general index of the significance of this 
equation but will enable us to measure the degree of accuracy 
of estimates based upon the equation. 


Tue CoMPUTATION OF THE STANDARD ERROR 


The standard deviation about a line of average rela- 
tionship, being a measure of the accuracy of estimates, may 
be termed the standard error of estimate, or, more briefly, 
the standard error. The term standard deviation is generally 
confined to the root-mean-square deviation about the 
arithmetic mean. The standard error is represented by 
the symbol S. 

In the computation of S we must know the normal value 
of Y which corresponds to each given value of X. By 
substituting the given values of XY in the equation 


Y = 16.92 +1.91X 


normal Y values may be computed. The deviations of the 
actual Y values from the normal may then be determined. 
The root-mean-square of these deviations is the required 
measure. A method of computation is illustrated in the 
following table: 


LINEAR CORRELATION 


TABLE 97 
(1) (2) (3) (4) 
State M otor Vehicles Registered, Y-computed d 

1921 (in thousands) Y-actual (2) — (8) 
Alabamase 4425 05. 82 99 -17 
PATIZON A Rees che as ys 35 51 — 16 
ATKCANSAS tas. sac. 67 82 —15 
Caliiormiay esse. 673. 755 — 82 
Coloradotee.. s06 145 151 —6 
Connecticut........ 137 252 — 115 
Delawares.n. sos... Q1 48 — Q7 
Bloridae cost neck. tke 97 97 0 
Georgian lees ok 131 147 — 16 
WatwhOgen. ores avs 51 61 — 10 
WTCHAN Ag ie cl eiere ec: 400 304 + 96 
NOWaARAe ahi a 460 229 + 231 
[ROTA ae neg ae 291 187 + 104 
iNentucky<. 0. sc. 126 149 — 23 
NGOUISIANA .:,.02.0.2 0.6 0. 80 147 — 67 
WVESIMG she e's) secs G7 101 — 24 
Manvland si. oe... 140 233 — 93 
Mirchi vanes str ox ATT 495 — 18 
Minnesota......... 328 256 + 72 
Mississippi........ 65 67 —2 
IMIISSOUEI. 4 2.) 0s 6 oe 346 348 —2 
Wionbandec dai ose. 58 87 — 29 
Nebraska... 40... 238 155 + 83 
INeVvadalec cn. cs. = 10 36 — 26 
New Hampshire.... 42 78 — 36 
New Mexico....... QA 40 — 16 
Nee Caroling 1.7... 148 101 + 47 
INES Dakotarn cece. c 92 51 + 41 
(Clits: Sean eee 720 719 +1 
Oklahomaas. uss... Q21 149 + 72 
RG RONe ers .sed ct 118 137 —19 
Rhode Island...... 54 109 — 55 
Ss @arolinal cc) sole 90 65 + 25 
See Dakotawee wun cs 119 59 + 60 
Tennessee......... 117 134 —17 
ANS CK ee eee 467 399 -+ 68 
Witaheeee aco. cet 47 67 — 20 
WV ebmOi uci shires: 36 51 -—15 
Wirpinia serene... 141 162 — 21 
Washington........ 185 239 — 54 
West Virginia...... 93 160 — 67 
Wisconsin......... 341 300 + 41 
Wroming:..05-.--: 26 59 — 33 


eho! © ones 


157,790 


370 THE MEASUREMENT OF RELATIONSHIP 


Se TEE ae 


= 60.6 


From the calculations we secure 60.6 as the value of S, 
(The symbol S, is used as this is the standard error of the 
Y-variable.) This is to be interpreted in precisely the same 
way as the standard deviation about an arithmetic mean. 
Given an approximately normal distribution of items about 
the line of relationship, 68 per cent of all the cases will lie 
within a range of + S (in this case 60.6), 95 per cent will 
fall within + 2S (in this case 121.2) and 99.7 will fall 
within + 3S (in this case 181.8). If there were no scatter 
about the line fitted to the points representing the cor- 
responding values of X and Y, S would have a value of 
zero, and the value of Y could be estimated from the value 
of X with perfect accuracy. The less the dispersion about 
the line, the smaller the value of S. The value of S serves, 
therefore, as an indicator of the significance and usefulness 
of the line which describes the relationship between the two 
variables. The standard error, it should be noted, is ex- 
pressed in the same units as the original Y-values. 


THe MAKING oF ESTIMATES 


We may, for a moment, consider the significance of these 
results. Let us assume that, not knowing the number of 
motor vehicles registered in a given state, we are under the 
necessity of estimating it. Two methods are open to us.: 
We may, in the first place, base the estimate upon our 
knowledge of the Y-variable alone. The total number of 
motor vehicles in the forty-three states included in the 
study is 7,616,000. Dividing this by 43 we have 177,116 
as the average. With no specific information as to the 
registration in a given state, the arithmetic mean of all 
the state figures would be taken as the most probable value 
for the state in question. (The most probable value of a 


LINEAR CORRELATION 371 


series of observations is the mean of the series.) How may 
we judge of the accuracy of this estimate? The standard 
deviation of the original distribution is a measure of the 
degree of variation about the mean and, therefore, a measure 
of the accuracy of an estimate based upon the mean. If 
the distribution approximates the normal type, the chances 
are 68 out of 100 that the true value for the state in question 
will not differ from the mean by more than the standard 
deviation. The standard deviation of motor vehicle registra- 
tion by states, as recorded in Table 96, is 171.4. The mean 
affords, therefore, a basis for a reasonable estimate, and 
the standard deviation affords some indication of the 
probabilities involved in making this estimate. (We are 
ignoring, of course, the possibility of basing estimates upon 
population, or similar factors.) 

Another method of estimating the motor vehicle registra- 
tion in a given state is open to us if we know the number of 
taxable personal incomes in that state. We know, as a 
result of the study described in the preceding pages, that 
the average relationship between motor vehicle registration 
and number of taxable personal incomes is described by 
the equation 

Y = 16.92 +1.91X. 


(The unit is 1,000 for each variable, it will be recalled.) 


If a state has 200,000 taxable personal incomes, it may be 
estimated from this equation that there are 399,000 motor 
vehicles registered in that state. This is the most probable 
value of Y as determined from the equation of average 
relationship. Is this estimate any better than the previous 
one, which took the mean Y as the most probable value? 
Does our knowledge of the average relationship between 
X and ¥Y aid us in estimating the value of Y from a known 
value of X? 

The answers to these questions are given by the stand- 
ard error of Y, and by the relationship between the standard 


372 THE MEASUREMENT OF RELATIONSHIP 


error of Y and the standard deviation of Y. The stand- 
ard error of Y (that is, the standard deviation about the 
line of average relationship) is 60.6. The standard devia- 
tion of Y is 171.4. Clearly the estimate made from the 
equation is more accurate than the estimate based upon 
the value of the mean Y. In the former case the odds are 
68 out of 100 that the error will not exceed 60.6 or, in 
terms of the original units, 60,600 vehicles. When the es- 
timate is made from the mean, the odds are 68 out of 100 
that the error will not exceed 171,400 vehicles.! From our 
knowledge of the relationship between the two variables, 
even though that relationship is by no means constant or 
perfect, we are able to reduce materially the errors of 
estimate. 


THE COEFFICIENT OF CORRELATION 


We have now secured two measures which aid us in de- 
scribing the relationship between variable quantities. The 
first is the fundamental equation of relationship, the ex- 
pression of the degree of change in one variable associated, 
on the average, with a given change in the other. The 
second is the standard error, the measure of the degree of 
“scatter”? about the line of average relationship. The 
standard error resembles the standard deviation in that it is 
a measure expressed in absolute terms, in the units employed 
in measuring the original Y-values. This measure enables 
us to determine in a given case the probability that an’ 
estimate based upon the equation of relationship will fall 
within certain limits. 

In measuring variation it has been found that an abstract 

1 In the present case, with a limited number of items and distributions which 
depart quite widely from the normal type, the precise probabilities cannot be so 
accurately determined from the values of Sy and gy. In particular, the presence 
of six or eight very large items, the rest being small in comparison, affects these two 


measures materially. With this qualification in the matter of interpretation we 
may use Sy and oy as useful measures of dispersion. 


LINEAR CORRELATION 373 


measure of variability is needed, one which is divorced 
from the absolute terms of the given problem. Such a 
measure is particularly needed, it was noted, when different 
distributions are to be compared. So, for measuring the 
degree of variability, a coefficient of variation is employed. 
There is need of a somewhat similar measure in connection 
with our present problem. We need a measure of the 
degree of relationship between two variables, an abstract 
coefficient which is divorced from the particular units 
employed in a given case. Karl Pearson has developed such 
a coefficient. 

This measure may be explained in terms of the preceding 
discussion. It was found that the usefulness of estimates 
based upon the equation of relationship could be determined 
by comparing the standard error of Y (the measure of scatter 
about the line of relationship) with the standard deviation 
of Y. If the standard error be as great as the standard de- 
viation the equation of relationship is of no use to us, but if 
the standard error be less than the standard deviation the 
accuracy of estimates may be improved by using this 
equation. The significance of the equation is thus indicated 
by the relation between the standard error and the standard 
deviation. But these are both in absolute terms, so that by . 
dividing one by the other an abstract measure may be 
secured. ‘Thus we might write 

Measure of correlation a 

y 
A somewhat more useful measure is secured by putting the 
ratio in this form: 
Measure of correlation = / i Sf 
y 

This measure, when used in connection with a linear equa- 
tion, is called the coefficient of correlation and is represented 
by the symbol r. 

A brief consideration of this formula will help to make 
clear the significance of rv. If there is no dispersion about 


374 THE MEASUREMENT OF RELATIONSHIP 


the line of relationship, S, will have a value of zero; the 
equation describes a perfect relationship between the two 
variables. In this case, as is clear from the formula, 7 must 
have a value of 1. 

The maximum value of S, is one which is equal to o,. 
Under these conditions, when the equation of relationship is 
of no aid in improving our estimates, the formula will give 
zero as the value of r. Such a value indicates that there is 
no relationship between the two variables; in other words, 
that the straight line of best fit is horizontal, passing through 
the mean of the Y’s. It shows that there is no tendency for 
the high values of Y to be associated with high values of X 
or for high values of Y to be associated with low values of 
X. The two variables fluctuate in absolute independence. 
In such a case the deviation of each point from the fitted 
line is equal to its deviation from the mean, and the two 
root-mean-square deviations are equal, as stated. 

Zero and unity are thus the limits to the value of r. The 
values found in practical work fall somewhere between these 
limits, approaching unity in cases where the degree of 
relationship is high. The greater the value of 7, the greater 
the confidence that may be placed in the equation as an 
_expression of a relation which is approximated in a high 
percentage of cases. In the example presented above, 
dealing with motor vehicle registration and number of 
taxable personal incomes, we have 


nae V1 _ (60.6)? 
(171.4)? 
= 935. 
This value indicates a definite and close connection between 
these two variables for the states included in the sample. 
The coefficient of correlation may be made somewhat 
more significant by giving it the sign of the constant b in 


1 The high value of r in this case is in part due to the presence of six or eight 
items which are much larger than the main group. The effect of such exceptional 
cases upon the coefficient of correlation is discussed in more detail below. 


LINEAR CORRELATION 375 


the equation of relationship. This sign indicates whether 
the slope of the line is positive or negative and, when 
attached to r, enables us to tell whether the relationship 
is direct or inverse. Thus in the present case high values of 
one value are paired with high values of the other. The 
correlation is positive and the coefficient should be written 
+ .935. When cotton production and prices are correlated 
the relationship is an inverse one, for high values of one 
variable are generally associated with low values of the 
other. 

The measurement of relationship in a given case is com- 
pleted when we have secured the three measures described. 
The equation of average relationship is an expression of the 
underlying law connecting the two variables, if such a law 
may be assumed. The standard error measures the varia- 
tion, in absolute terms, about the line of relationship. The 
coefficient of correlation is an abstract measure of the degree 
to which the average relationship actually holds in practice. 


DETAILS OF CALCULATION 


In the preceding section the attempt has been made to 
explain the various measures necessary in studying the 
relationship between variable quantities without introducing 
a detailed explanation of procedure. We may now return 
to a consideration of the details of calculation, including 
certain methods by which this calculation may be reduced 
to a minimum. 

The procedure followed in the preceding illustration is a 
logical one to employ in deriving the three required values. 
This method is capable of general application, but the labor 
involved may be materially reduced by taking advantage of 
a short-cut method of deriving S,. This method may be 
first explained with reference to data of the type dealt with 
above. And, for the present, the discussion will be confined 
to cases in which the relationship between variables may be 
described by a straight line. 


376 THE MEASUREMENT OF RELATIONSHIP 


The first problem is the derivation of the equation of 
relationship. A line of the type 
Y=a+bx 


is fitted by the method of least squares. 

The next step is the computation of S,?, the square of the 
standard error. This was done in the above illustration by 
measuring the deviation of each individual observation from 
the fitted line, and getting the mean-square of these devia- 
tions. It may be shown! that this value can be derived 
from the following equation: 


1 The standard error is computed from the formula 


2 (a?) 
2 
Sy N 


where d represents a single deviation from the fitted line, or the difference between 
the actual and the computed value of Y ina given case. The latter is derived from 
the equation 


Y,=a+bX. 


(The symbol Y-, is used to represent the computed value of Y.) 
If we let Y represent the actual value we have, for each residual, 


Fea Pea 6 
or 
Cs iN ale (1) 


There will be as many equations of this type as there are points. Multiplymg 
each one by d, and adding, we have 


2(@) = aX(d) + b2(dX) — TAY). (2) 
But, since the line was fitted by the method of least squares, 
Z(d) =0 
(dX) =0 


(For proof of this see Appendix A.) 
and, therefore, 


Z(@) = - SY) (3) 


Returning again to equation (1), we may multiply throughout by Y, and add, 
securing 


2(dY) = aX(Y) + b2(XY) - 2(Y?). (4) 
Substituting the equivalent of 2(dY) in equation (3) we have 
2(d@) = 2(¥*) — aZ(Y) — b2(XY) (5) 


from which the given formula for S,? is derived. 


LINEAR CORRELATION 377 


_ 3(%?) - a3(¥) - by (XY) 


orig N 


The quantities a and b are the constants in the equation to 
the fitted straight line. The other values relate to the 
original observations. Substituting in this equation the 
necessary values, taken from Table 96 and from page 367 
above, we have! 

2,612,066 — (16.92 x 7,616) — (1.91238 x 1,215,966) 


2 = 
Sy 43 
_ 157,814 
ye 548 
= 3,670 
S, = 60.6 


From this point the procedure may follow that already 
described, r being computed from the formula 


= Sy? 
roy 5. 


The coefficient 7 may be secured, however, without com- 
puting S as an intermediate value. The above formula for 
r may be reduced to 
_ aZ(Y) + b2(XY) — Ne,? 

ZY") Ne” 
where c, is the difference between the mean Y and the 
origin employed in the calculations.? If the origin is zero 


2 


1 For this calculation the value of b is given to a greater number of decimal 
places than in the equation as first presented. 
2 The formula 
S.2 
ae ee 
0," 
may be written 
2D (d) 
= (y’) 
in which y refers to deviations from the arithmetic mean of the Y’s. But 
Zig) 2 (Y?) 
Nee EAN 


where Y represents a deviation from an arbitrary origin (in this case zero on the 


aie 


Cy” 


378 THE MEASUREMENT OF RELATIONSHIP 


on the original Y scale, c, will be equal to the arithmetic 
mean of the Y’s. 
In the present case, using the data of Table 96, we have 


7,616 
Gy = 3 


= 177.116. 


The other values are the same as those employed above in 
computing S,. Substituting in the formula, we have 


we 1,105,337 .98 
1,263,152.7 
= .875 
r= .935 


In effect, then, the labor of fitting a straight line by the 
method of least squares gives us practically all the values 
needed in securing S and r, the two other measures necessary 
for a complete description of the relationship between two 


variable quantities. The only additional values required 
are 2(Y?) and c,. 


THe CONSTRUCTION OF A CORRELATION TABLE 


In the example presented above we had only forty-three 
observations. With a larger number it becomes practically 
impossible to retain the individual values in the study of 
relationships. These individual items must be grouped in 


original scale) and cy represents the difference between this origin and the mean 
of the Y’s. 
Therefore 
> (d?) 
2D (¥2) — Ne,? 
Substituting in this equation the equivalent of > (d?), as given in the footnote 
on page 376, 


r= | 


> (¥2) - a (¥) - bE (XY) 
> (¥2) — Ne,? 


r= 


Simplifying, 
voce ad (Y) + b> (XY) — Ne? 
* > (¥2) — No? 


LINEAR CORRELATION 379 


significant classes, and all computations must be based upon 
these grouped data. This means, merely, that we must 
handle data organized in frequency distributions. Since 
we are dealing with two variables, however, the simple 
frequency table must be modified to meet the needs of the 
present problem. Such a modified frequency table, arranged 
to facilitate the computation of the values needed in studying 
relationship, is termed a correlation table. 

As a typical problem, involving the construction of such 
a table, we may consider the relation between the discount 
rates of Federal Reserve banks on a standard type of com- 
mercial paper and the discount rates of commercial banks 
upon similar paper. Since this paper may be rediscounted 
at the Federal Reserve banks by the member banks, some 
degree of relationship between the rates may be expected. 
Our present object is the measurement of that relationship. 

The first step is the tabulation of the original observa- 
tions. Monthly values of each variable! were secured for 
each of the twelve Federal Reserve cities over a period of 
42 months, from July 1, 1920, to December 1, 1923. In 
the process of tabulation the items must be combined so 
that a Federal Reserve bank discount rate is paired with 
the corresponding rate charged by the commercial banks 
of the same city. Fig. 68 illustrates the method of tabula- 
tion. 

1 The discount rates of the Federal Reserve bank relate to trade acceptances 
maturing within 90 days. The commercial bank rates are those charged on cus- 
tomers’ prime commercial paper maturing in from 30 to 60 days (in some cases 


from 30 to 90 days). The customary rate over a given 30 day period was taken 
as of the middle of that period. 


380 THE MEASUREMENT OF RELATIONSHIP 


375 425 475 5.25 5 
to 
4747 4.74%) 5.247| 5.74%) 6. 

t 


17 


47 


Fic. 68. — Tabulation of Items in a Correlation Table 
Tabulation having been completed, a correlation table 
designed to facilitate the later computations may be con- 
structed. This is shown below. 


LINEAR CORRELATION 381 


TABLE 98 


Correlation Table, Discount Rates of Federal Reserve Banks 
and Discount Rates of Commercial Banks 


X-Federal Reserve Banks’ Discount Rates (per cent) 


Class- 
Intervals 


3. 75—-|4. 25- 


4.75-|5. 25-|5.75-|6. 25-16. 75-| 
4.24 |4.74 24 ty Totals 


5.24 |5.74 |6. 24 


4.00 |4.50 


5.00 |5.50 |6.00 |6.50 |7.00 


= | —<—<oq|— |__| _____ 


= fas 0 | 227 
S(d’)2 0 | 227 


4/18 32 256 


| 
foe) 
S 
o 


be Or 


rwlin 
1 
a 
a 
bt) 


Be alley 
em Or 


NID 
won 
a 
_ 
i] 
o 


Dr 
Jr 
ae 
os 
or 
i=) 


am 
£3 
fo>) 
i=) 
Oo 


| 
or 
a 
o 


or or 
~I 29 
or 


Y-Discount Rates of Commercial Banks (per cent 


4.75— 5.00 
5.24 
4. 25— 4.50 
4.74 


504 2161 |10245 


Totals 


In this table, it will be noted, an arbitrary origin is em- 
ployed for each variable, and the class-interval unit is used 
in the calculations. We may employ the symbols X’ and 
Y’ to represent the deviations from the arbitrary origin 
(which is located at point 4,4 on the original scales). 


ComputTaTION oF MrasureEs oF RELATIONSHIP 


From this correlation table all the values needed in fitting 
a straight line to the data, and in computing the measures 
S and r, may be derived. The quantities 2(X’), 2(X)’, 
>(Y’) and D(Y’)? are computed by methods already familiar 
to the student. The product of the paired values Z(X’Y’) 
may be computed directly from the correlation table, but 
it is perhaps simpler for the beginner to re-arrange the data 


382 THE MEASUREMENT OF RELATIONSHIP 


in columnar form, as in the following table. When the 
figures are disposed in this way one line is employed for each 
compartment of the original correlation table in which 
items have been recorded. 


TABLE 99 


Discount Rates of Federal Reserve Banks and 
Discount Rates of Commercial Banks 


Computation of Values Required in Curve Fitting 


(1) (2) (3) (4) 
x! y’ f f(X’ Y’) 
0 1 g °0 
0 g 10 0 
0 3 11 0 
0 4 1 0 
1 1 1 1 
1 g QA 48 
1 3 110 330 
1 4 90 360 
1 5 g 10 
Q 3 5 30 
Q 4 29 232 
2 5 9 90 
g 6 5 60 
3 4 6 72 
3 5 10 150 
3 6 4 72 
4 4 30 480 
4 5 22 440 
4, 6 63 1,512 
4 7 7 196 
4 8 1 32 
5 5 1 25 
5 6 9 270 
5 7 9 315 
5 8 1 40 
6 5 3 90 
6 6 36 1,296 
6 7 1 42 
6 8 g 96 


LINEAR CORRELATION 383 


The values required in fitting a straight line and in com- 
puting the standard error and the coefficient of correlation 
are: 


N= 504 D(X")? = 4,579 
D(X’) = 1,227 DX’ Y) = 6.989 
Ty = B16! D(Y’)? = 10,245 


The equation to the best fitting straight line is found to be 
Y’ = 2.7155 + .6458.X’, 
Substituting in the formula 


s2 2(Y’)? — aZ(Y’) — b2(X'Y’) 
y ==," 
N 
we have 
gs 10,245 — (2.7155 x 2,161) — (.6458 x 6,289) 
v ———————————&E—x—EESoE 
504 
= ,6257 
Sen 


To determine the value of the coefficient of correlation 
we have only to substitute the proper values in the equation 
_ ad(¥’) + b3(X’¥’) — Ne,?, 
< m(Y’)? — Ne,? 

When this is done we have 
> (2.7155 x 2,161) + (.6458 x 6,289) — (504 x 18.38437) 


2 


x 10,245 — (504 x 18.38437) 
_ 663.9 
~ 979.27 
= .67795 

r= + .82 


All these calculations have been carried through in class- 
interval units, with reference to an origin at point 4, 4 on 
the original scales. The value of r is not affected by this 
fact, but the estimating equation and the standard error 
should be corrected. 

The value of S,, in class-interval units, is .791. Since 
the class interval of the Y-variable is .5%, we have, in 
original units, 


384 THE MEASUREMENT OF RELATIONSHIP 


Spa 25% x 791 
=.40%. 


The equation may be corrected in a similar fashion. The 
class interval being .5% for both X and Y, we may write 


QY’ = 2.7155 + .6458(2X’) 


or 
Y’ = 1.3577 + .6458X" 


which is the equation in terms of original units. 


In changing the origin, we know that Y’ = Y — 4 and 
X'= X—4. Therefore 


Y —4=1.3577 + .6458(X — 4) 
or 
Y = 2.7745 + .6458X. 


We have now the three values required for determining 
the relationship between Federal Reserve discount rates 
and corresponding commercial bank rates, during the 
period covered. The equation describes the average re- 
lationship, the standard error serves as a measure of 
the reliability of estimates based upon this equation, and 
the coefficient of correlation serves as an abstract measure 
of the degree of relationship between the two variables. 

The significance of the standard error, S,, is brought out 
graphically in Fig. 69. The line of average relationship has 
been drawn on this scatter diagram, and what may be called 
“zones of estimate” have been marked out about this line. 
Within the zone having a width equal to 2S, centering at the. 
fitted straight line, 68 per cent of all the points should fall, 
on the assumption that the distribution is normal. Within 
the zone having a width equal to 6S, centering at the fitted 
straight line, 99.7 per cent of all the points should fall, on 
the same assumption. The smaller the value of S the nar- 
rower these zones would be, and hence the more accurate 
would be the estimates which are based upon the equation 
of average relationship. 


LINEAR CORRELATION 385 


alll 


Naan 


Commercial Bank Rates —Percent 


Baan 


315060 4.25 45 «525 «6515 8625 675 725 
Federal Reserve Bank Rates —Percent 


Fie. 69.— Scatter Diagram of Federal Reserve and Commercial Bank Rates, 
with Line of Average Relationship and Zones of Estimate 


Tur Propuct-Moment ForRMULA FOR THE COEFFICIENT 
OF CORRELATION 


In the preceding examples the coefficient of correlation 
has been computed from the formula 

ad(Y) + b2(XY) — Ne; 
‘ =(¥2) — Ne,? 
which is derived directly from the method of fitting a 
straight line by least squares. The usual formula differs 
somewhat from this, and it is advisable that the student be 
familiar with it. 

When a straight line is fitted to data, the origin being 
at the point of averages, the two normal equations 


72 


386 THE MEASUREMENT OF RELATIONSHIP 
2(Y) = na + b2(X) 
D(XY) = ad(X) + b2(X?) 
reduce to 2 (ay) = b2(2’) 
for Z(z) = 0 and Z(y) = 0 


The slope, b, is the only constant required, and this may be 
computed from the relationship 


Under the same conditions the formula 


aD(Y) + bE(XY) — Ne,? 
Sia) ee 


reduces to ? bz (ay) 


72 = 


for c, = 0 when the deviations are measured from the mean 
of the Y’s. Substituting for 6 its equivalent, as just deter- 
mined, we have , _ Bley) -B (ay). 

2 (y?) - (a?) 
But =(y?) = No,? and =(2’) = No. 


Therefore ie 2 (xy) -2 (zy) 
Nose. 

and r= me NS 
Noi0y 


in which zx and y refer to deviations from an origin at the 
point of averages. 
This formula may be expressed 
map 
C50 
in which | oe 2 (xy) 
N 
The quantity p is the mean product of the paired values of 
x and y. 
The computation of the coefficient of correlation from 
this formula proceeds along lines somewhat different from 


LINEAR CORRELATION 387 


those outlined above. As we have seen, both the arithmetic 
mean and the standard deviation may be readily computed 
by the selection of an arbitrary origin from which all devia- 
tions are measured, a later correction being made to offset 
the error involved in using this arbitrary origin. Similarly, 
the mean product p may be computed by a short method, 
requiring the use of assumed means and the application of 
a correction at the end of the process. 

If x’ and y’ represent deviations from points arbitrarily 
selected as assumed means, while p’ represents the mean 
product of such deviations, then 

,_ &(2'y’) 

Aomean ae 
The computation of p’ is not difficult, for deviations may be 
measured from central points, and may be expressed in 
class-interval units. Having p’ we may secure the true 
mean product from the formula 

P=Pp —Crly 
in which ec, and c, represent the differences between the true 
and assumed means of the x’s and y’s, respectively.! 

1 The following is a proof of this relationship: 

zx’ = deviation of any point from assumed mean of 2’s 
x = deviation of same point from true mean of 2’s 
c, = difference between true and assumed means of 2’s 
y’ = deviation of same point from assumed mean of y’s 
y = deviation of same point from true mean of y’s 
c, = difference between true and assumed means of y’s 
zw =2r+¢; 
y =ytty 
a'y’ = (% + cz) (y + cy) = ry + Coy + Cy x + Crly. 

For the sum of all such products for N points, we have 
Z(a'y’) = Z(ay) + es Z(y) + cy U(x) + Nezty. 


But L(y) =0 and X(zx) = 0. 
Therefore Z(a'y’) = Z(xy) + Nezty 
L(zx'y’) i 2 (ay) ie 
N N 
Z(zy) _ 2(2'y'!) © is 
N N 


Or p= p' — Crly. 


388 THE MEASUREMENT OF RELATIONSHIP 


Tue Propuct-Moment Metuop, UNcRouPED Data 


This method may be illustrated with reference, first, to 
ungrouped data, using the figures for personal incomes (X) 
and motor vehicle registration (Y), by states. The values 
required for this computation, as given in Table 96, are 


N = 43 
D(X) = 3,602 
X(Y) = 7,616 


D(X?) = 603,968 
X(¥2) = 2,612,066 
(XY) = 1,215,966 
The mean product can be computed from the formula 


ri Z (ay) _ a ee 


N 


We may select as arbitrary origin the actual origin on the 
two original scales. Hence we have 
D(xXY 


For the two standard deviations 


Re toore t 


N 
2 
Oy = ya — ¢,. 


(When the arbitrary origin is at zero on the original scales, the 
symbol X corresponds to x’ and Y corresponds to y’, as used in 
the formulas.) 

These measures may be computed readily from the values 
secured from Table 96: 


_ 8,602 _ 7,616 
Cz = —fg~ = 83.767 cy = g— = 177.116 
c2 = 7,016.910 c,? = 31,370.077 
1,215,96 
p = 215,988 _ 14,936.476 


43 
= 13,441 .803 


LINEAR CORRELATION 389 


prey) 808 hie 010 oy = 2,812,066 5 570.077 


43 43 

= 83.84 171.41 

Solving for the coefficient of correlation, 
etre SAAR 


The equation to the straight line which describes the 
average relationship between X and Y may be derived 
from the values required for the preceding calculations. 
When the origin is at the point of averages this equation 


may be written Fe 
y= 7 = 2 


Substituting the proper values, we have 


W141 
83.84 


+ .935 
1.9127. 
This, it will be noted, is the equation secured by the method 
of least squares. The constant term representing the 


y-intercept disappears, since the origin is at the point of 
averages, through which the least squares line must pass.! 


y 


o A b 
1 That the formula y = r— z is equivalent to the formula based upon the 

Oz 
method of least squares may be readily demonstrated. When the line passes 
through the point of averages, the equation, Y = a + bX, becomes y = bz. But 


_ Z(zy) 2 (ay) 


= (a) * We may write, accordingly, yc = 5) x 
This is equivalent to Oy 
Ye =Tr—wZ 
for the latter may be written 
z D(a 
(1) Ye = (cy) aes (3) Yc = PE MES hi seen i) 
Noyor Oz m/22 22 
N N 
2 (zy) 2 (zy) 
2) ye = ———- 4) Ye = x 
(2) y ae (4) y Te) 


(The symbol y, is employed for the computed value of y, in these equations, to 
avoid confusion with the actual y’s which appear in the rigl:t-hand members of the 
equations.) 


390 THE MEASUREMENT OF RELATIONSHIP 


When the product-moment method is employed in com- 
puting the coefficient of correlation and in determining the 
equation of regression, the standard error, S,, may be 
derived by a simple change in the formula first presented 
for r. From the expression 


we may secure the formula 

Sy — oyV1 — 72 
which enables us to compute S,, if we have the values of 
o, and r. In the present case, 


S, = 171-41V 1 (.935)3 
= 60.6 


Tur Propuct-Moment Mertuop, CuassiFIED DaTa 


The product-moment method is also applicable to cases 
in which it is necessary to construct a double frequency or 
correlation table. The procedure is shown in detail in 
Table 100. 

This table is identical with that previously presented for 
the same data, except that a different arbitrary origin has 
been selected. 

The value 5.5 is adopted as the assumed mean of the 
X’s, and the value 6.5 as the assumed mean of the Y’s. 
Deviations are measured in class-interval units from this 
origin. In each compartment of the correlation table 
there are three figures, involved in the computation of - 
x(a’y’). The figure in the center indicates the number of 
items falling in that compartment. Thus there are seven 
pairs having X values between 5.75 and 6.25 (mid-point 
6.0) and Y values between 7.25 and 7.75 (mid-point 7.5). 
For each of these pairs 2’ (the deviation from the assumed 
mean of the X’s) is + 1, in class-interval units, and y’ (the 
deviation from the assumed mean of the Y’s) is 4+ 2, in 
class-interval units. For each pair, therefore, z’y’ = + 2. 


391 


LINEAR CORRELATION 


s[eqOL 


(OI+) | (OF+) | (99+) 
g oll IL 
a+ b+ 9+ 
(og—) (0) (63+) (08I+) (s+) 
0g 9 63 06 I 
I- 0 I+ a+ e+ 
(0) (0) (0) (0) (0) 
I tra OL 6 3S 
0 0 0 0 0 
(81+) (s9+) (0) (g—) 
6 $9 P g 


roo | scs— | 3— | gat eg | ves 
ree seas oes ot | 08 CAR 
: 0 he be Pe ytd 
AIL AII+ I+ LIl POL 


—235'¢ 


—oL't 


(1199 4g) samy qunooiqg syung saiasayy posapag — ¥ 


s7pasaqUr 
-88D]) 


uoynjesio) fo qunoyfeo) ay, fo uoynindwog ay} burypuysnyy 
pun ‘syuvg powdsumoy fo sappy yunoosig oy; pun eazny wunooei1T YUN sas9eIY posepay usonyag uoynjey ay} burmoyg qv] 017012410) 


OOL XTaVZ, 


A s{Uuvg pwsuwmog—1 


(quad Jaq 


392 THE MEASUREMENT OF RELATIONSHIP 


TasLE 100, Continued 


A.Mz = 5.5 A.M, = 6.5 Z(r'y) _.. 
5 - 2288 Poe ON. ae 
2 504s, — 359 _ 1231 

acer 565 Cy = “504 BT 
2 = .319 mn 712 = 2.4425 — .402 
ve Df(d’)? c,? = .507 = 2.0405 

N a2 = 1235 a 
1753 ¥ ~ 504 Oz,0y 

Tye = 2.450 po 20405 

= 3.478 1.777 X 1.394 
C= ae 0,2 = 2.450 — .507 _ 2.0405 

= 3.478 — .319 ~ 9 ATT 

= 3.159 r=+ 82 
Oz = V3.159 = 1.943 M, = 5.218 

=1.777 Oy = 1.394 M, = 6.144 


Nore: The class-interval unit has been employed in all the computations shown 
on this table. 


This figure appears at the top of the compartment. But 
there are seven pairs in this compartment, so the sum of 
x'y’ for this group is + 14. This figure appears in paren- 
theses at the bottom of the compartment. To secure 
Z(x’y’) it is necessary to add algebraically the values 
secured in this way for all compartments. The addition 
is first carried out for the different rows, the subtotals 
being given in the column at the right of the table. It is 
found that 2(a’y’) = + 1225, in class-interval units. 

The values of c, and c, have been secured in computing 
the standard deviations by methods already familiar. 
These values, also in class-interval units, are 

Clee C65 cy = —.712 
We have, therefore 
Z(xy) _ 2(2"'y') 


N Nea oe 
1225 
= = - -402 
= 2.0405 


This is the value of p, the mean product, in class-interval 


units. Proceeding, 
_ 2(y) 
No.0, 


UR 


LINEAR CORRELATION 393 


Pp 
0,0, 
= 2.0405 
1.777. x 1.394 


+ .82 


In computing r, both the numerator and denominator of 
the final fraction (the mean product and the two standard 
deviations) are in class-interval units. Since this is true, 
r may be computed directly without reducing the figures 
to the original units. The entire operation, therefore, is 
carried on in simple class-interval units. 

In deriving the equation to the straight line which 
describes the average relationship between x and y from 
the formula 


Oy 
y=r—2z 
Ox 


g, and o, should be expressed in units of the original scale.! 
This is done by multiplying the present values by the class- 
intervals. 

g,(in original units) = 1.777 x .50 = .8885 

o,(in original units) = 1.394 x .50 = .697 
Substituting the given values in the formula we have 


.697 p 
.8885 


642 


y = .82 


Tuer Lines or REGRESSION 


In the above discussion certain terms ordinarily employed 
in the treatment of correlation have been purposely omitted. 
Several of these should be explained. 


1 When the class-intervals happen to be the same, as in the present case, the 
change is not necessary, as the relation between numerator and denominator is 
not altered. In practice it is advisable always to express the two standard devia- 
tions in original units at this stage in the calculations. 


394 THE MEASUREMENT OF RELATIONSHIP 


The equation to the line of best fit in the preceding 
illustration was found to be 


y = .642z 


when the origin was taken at the point of averages. In 
this equation y is expressed as a function of x; that is, x is 
taken to be the independent variable and y the dependent 
variable. The equation expresses the average variation in 
y (discount rates of commercial banks) corresponding to a 
change of one unit in x (discount rates of Federal Reserve 
Banks). This line of relationship corresponds precisely to a 
line of trend, which describes the average change in a given 
series accompanying a unit change in time. A line which 
thus describes the average relationship between two varia- 
bles is termed a line of regression. Its equation is termed a 


regression equation, and the quantity 7 2% which gives the 
0 


slope of such a line is called a coefficient of regression. The 
use of these terms dates back to early studies by Galton, 
dealing with the relation between the height of fathers and 
the height of sons. Sons, Galton found, deviated less on 
the average from the mean height of the race than their 
fathers. Whether the fathers were above or below the 
average, the sons tended to go back or regress towards the 
mean. He therefore termed the line which graphically 
described the average relationship between these two 
variables the line of regression. The term is now used. 
generally, as indicated above, though the original meaning . 
has no significance in most of its applications. 

In any given case equations to two lines of regression may 
be computed. One is an expression of the average relation- 
ship between a dependent Y-variable and an independent 
X-variable; the other describes the relationship between a 
dependent X-variable and an independent Y-variable. The 
significance of the two may be indicated graphically. 

Figure 70 is derived directly from the scatter diagram 


LINEAR CORRELATION 395 


presented in Fig. 69. The circle in each column represents 
the mean Y-value of all the items falling in that column. 
Thus in the first column there are 24 cases including all 
those with X-values falling between 3.75% and 4.25%. 


115 


B 


2 
a 


Commercial Bank Rates —Percent 
8 


Means o TS 699) F565) (615) 929 6 45)" 6.20) 625195) 


19 R 29 


Federal Reserve Bank Rates ~Percent 


Fic. 70. — Showing the Relation between Discount Rates of Commercial Banks 
and Federal Reserve Bank Discount Rates. (The broken line connects the 
means of the columns and the straight line shows the average change in 
commercial bank rates corresponding to a unit change in Federal Reserve 
bank rates; i.e., it represents the regression of Y on X) 


the Columns 


The Y-values vary, however, being distributed as shown in 
the following table. 


TasBLe 101 
Computation of the Arithmetic Mean of an Array 
Class-interval sete as aris fn 
5.75 — 6.24 6.0 1 6.0 
5.25 — 5.74 5.5 11 60.5 
4.75 — 5.4 5.0 10 50.0 
4.25 — 4.74 4.5 2 9.0 


24 125.5 


396 THE MEASUREMENT OF RELATIONSHIP 


125.5 
ire ms 5.23 
Similar mean values are obtained for the other columns. 
These are plotted in Fig. 70, together with the line of regres- 
sion of Y on X. 

In Figure 70 the X-variable (Federal Reserve bank dis- 
count rates) is independent. As it increases from 4.0% 
to 4.5, 5.0, 5.5%, and so on, the average of commercial 
bank rates increases also. An average commercial bank 
rate of 5.23% was associated with an average Federal 
Reserve bank rate of 4%; an average commercial bank 
rate of 5.65% was associated with an average Federal 
Reserve bank rate of 4.5%, and so on. ‘The slope of the 
straight line, which is the line of regression or the line of 
average relationship, measures the average increase in 
commercial bank rates corresponding to a unit increase 
in Federal Reserve bank rates. 

It is possible to view the relationship between these two 
variables in another light. These questions arise: Given 
a certain commercial bank discount rate, what is the average 
Federal Reserve bank rate associated with it? And for a 
given change in commercial bank discount rates, what is the 
average change in the corresponding Federal Reserve bank 
rates? The commercial bank rate is now looked upon as 
independent, and the Federal Reserve rate as an associated 
dependent variable. These questions are answered by 
Fig. 71. The points marked by the small circles and 
connected by the broken line show the locations of the- 
arithmetic means of the items falling in the various rows. 
Thus the three X-items in the first row have an average 
value of 4.17%. This is the average Federal Reserve bank 
discount rate associated with a commercial bank rate of 
4.5%. The average Federal Reserve bank rate associated 
with a commercial bank rate of 5.0% is 4.35%, and so on. 
The straight line fitted to these points indicates the rela- 
tionship between the two, its slope measuring the average 


LINEAR CORRELATION 397 


increase (or decrease) in Federal Reserve bank rates 
associated with a unit change in commercial bank rates. 

This is the line of regression of X on Y. The general 
formula for the equation to this line is: 


x +22 
oy 


Means 
of Rows 


(6.62%) 


Commercial Bank Rates ~Percent 
cs 
e 


eigme 4 >) 415 67525 510 4625 675 ‘125 
Federal Reserve Bank Rates~Percent 


Fic. 71. — Showing the Relation between Federal Reserve Bank Discount Rates 
and the Discount Rates of Commercial Banks. (The broken line connects 
the means of the rows and the straight line shows the average change in 
Federal Reserve bank rates corresponding to a unit change in commercial 
bank rates; i.e., it represents the regression of X on Y) 


Substituting the present values, we have 


or 


398 THE MEASUREMENT OF RELATIONSHIP 


The factors in this equation, it will be seen, are the same as 
those entering into the formula for the line of regression of 
Y on X’. If r is equal to 1 the two lines coincide, and if, 
in addition, the two standard deviations are equal, the line 
of regression will bisect the angle formed by the axes. In 
any case, if the points be plotted on a chart scaled in units 
of the standard deviations, we have y = rz and the slope 
of the line of regression is thus equal to the value of r. 

The coefficient of regression is represented by the symbol 
b. In a simple correlation problem there are two such 
coefficients, representing the slopes of the two lines of re- 
gression. These are 


o 

bye = Tr ¥ 

Oz 

Oz 

bay = 7 — 
zy = 


y 


(The subscripts indicate the relation between the two vari- 
ables. The first subscript refers to the dependent variable in each 
case.) 


The coefficient 7 appears in both formulas. This being 
so, it is clear that r may be computed from the regression 
coefficients. For 


r= Vbyr-bry = yee =VPr 


z Oy 


G 
1 The formula x =7r—y 
Oy 


2 (xy) 
Z(y’) 


may be reduced to 2x = 


This is the equation to a line fitted to the points plotted in Fig. 71 in such a way 
that the sum of the squares of the horizontal deviations is a minimum. 
The formula 
2 (zy) 


D(x?) o 


is the equation to the line for which the sum of the squares of the vertical deviations 
is a minimum. An understanding of this point may make clear the difference 
between the two lines of regression, 


LINEAR CORRELATION 399 


Thus if we know the slopes of the two lines of regression r 
may be determined. In the present example, using the 
exact values obtained, 


r= V .64248 x 1.045 = .819 


UsE OF THE EQUATIONS OF REGRESSION 


The two equations of regression given above 


y= .64r 


and 
x = 1.045y 


describe relations between deviations from the respective 
arithmetic means. That is, the origin is at the point of 
averages, and to use the equations we cannot use the 
original values of X and Y but must express them as devia- 
tions from their means. For example, we wish to determine 
the normal commercial bank rate associated with a Federal 
Reserve bank rate of 6%. The mean value of the X-variable 
(Federal Reserve bank rates) is 5.218%. A rate of 6% 
represents a deviation from the mean of +.782. Substi- 
tuting this value in the first of the above equations, we have 


y = .64( +.782) 
= +.500 


This is the average y-deviation associated with an x-devia- 
tion of +.782. To get the normal commercial bank rate 
associated with a Federal Reserve rate of 6% the value 
+ .500% must be added to the mean commercial bank 
rate, 6.144%. The value we wish is thus 6.644%. 

This calculation has been rather round-about because of 
the form of the equation of relationship. This equation can 
be put in more appropriate form for such computations. 

Let 

X = arithmetic mean of the X’s 
Y = arithmetic mean of the Y’s. 


400 THE MEASUREMENT OF RELATIONSHIP 
Then 


may be written 


Y-Ve7 YX) 
In this last equation X and Y represent the values of the 
variables on the original scales, and not as deviations from 
their respective means. In terms of the codrdinate chart, it 
means shifting the origin from the point of averages to a 
point corresponding to zero on each of the original scales. 

To illustrate the greater utility of the equation in this 
form, the equation 

y = .642 

may be changed in the method indicated. It becomes 


Y — 6.144 = .64(X — 5.218) 
GLX 9640 
Y = .64X + 2.804 


This is the equation with the origin so shifted that the 
original values may be employed directly. To determine 
the commercial bank rate normally associated with a 
Federal Reserve rate of 6% we may substitute the latter 
value in the equation just secured. 


Y = (.64 x 6.0) + 2.804 
= 6.644 


Precisely the same results are secured as with the equa-. 
tion in the other form, but for many purposes it is preferable 
to have an equation in which the actual values may be 
inserted. 

The equation 


c=r—y 


may be similarly changed to 


LINEAR CORRELATION 401 


SUMMARY 


In the foregoing pages there have been presented two 
quite different methods of securing the values required in 
measuring the relationship between two variables. The 
steps in the two methods may be briefly summarized. The 
method of least squares is basic in both cases, but that term 
may appropriately be employed to describe the first method 
outlined, for the process of fitting the line is the first and 
fundamental step in that procedure. 


I. The Least Squares Method. 


A. Data to be handled as individual items. 


1. Fit a straight line to the data by the method of least 
squares. A simple arrangement of the data in columns 
will permit the ready computation of the required 
values, 2(X), 2(Y), 2(X?), 2(¥Y?), 2(XY). The equa- 
tion thus obtained describes the average relationship 
between the two variables. 

2. Compute the standard error, S,, from the formula 

2(¥*) — aZ(Y) — b2 (XY) 
i a 
N 
S, is a measure of the reliability of estimates based 
upon the equation of relationship, and is to be inter- 
preted in the same way as is the standard deviation 
about an arithmetic mean. 
3. Compute the coefficient of correlation, r, from the 


formula 

_ aX(Y) + b2(XY) — Ney? 

ZY?) —Ne/7 
Give r the sign of the constant b in the equation of 
regression. This coefficient is an abstract measure of 
the degree of relationship between the two variables, in 
so far as this relationship may be described by a 
straight line. 

4. If an equation describing the regression of X on Y 
(X being dependent) is desired, the proper values may 
be substituted in the two normal equations 

D(X) = na + b2(Y) 
Z(XY) = ad(Y) + b2(Y?) 


2 


402 THE MEASUREMENT OF RELATIONSHIP 


The equation secured will be of the type 

X=a+bY 
The standard error, S,, may be computed by making 
the appropriate changes in the formula as given for 
S,. The value of r will be the same as in the pre- 
ceding case, in which Y is dependent. 


B. Data to be classified. 


1. Select an appropriate class-interval and tabulate the 
items in the form of a correlation table. 

2. Compute the necessary values for fitting a straight line 
to the data. In doing so, an arbitrary origin may be 
selected for each variable, and all values expressed in 
class-interval units. A re-arrangement in columnar 
form may facilitate the computation of the quantity 

p21 0. Ge) ia 

3. Compute the standard error, employing the formula 
given above. 

4. Compute the coefficient of correlation from the formula 
given above. 

5. If the above calculations were carried on in class- 
interval units the equation of average relationship and 
the standard error should now be expressed in terms 
of the original units of measurement. If an arbitrary 
origin was employed the equation should be corrected 
so that the variables relate to deviations from the 
true origin. 


II. The Product-Moment Method. 


A. Data to be handled as individual items. 


1. Arrange the paired observations in parallel columns 
and compute the quantities 2(X), 2(Y), D(X), 
zey*), ZUXY). 

2. Divide these quantities throughout by N. For the 
first two of these quotients we may use the symbols 
Cz and cy, (i.e. 

2(X) _ 
MVEA 


LINEAR CORRELATION 403 


and 
28 = Cy). 
. Compute the mean product from the formula 
oe a deg 


. Compute the two standard deviations from the 


formulas 
Oz= 2(X?) == Ca 
VN 


. Compute the coefficient of correlation from the 
formula 


Z 
C20, 


T= 


. Determine the equations of regression by substituting 
the proper values in the formulas 


Oy 
y=r—e 
Cz 
oO 
r=r—y 
Oy 


(Note: For each of these equations the origin is at the 
point of averages.) 


. If desired, transfer the origin to zero on the two 
original scales by substituting the arithmetic means 
in the equations 


V2 erltiee x) 
Oz 
ye ere (Yer) 
oy 
. Compute the two standard errors from the formulas 
Sy= OV - 7 
ie = Cz v 1 — 7? 


404 THE MEASUREMENT OF RELATIONSHIP 
B. Data to be classified. 


1. 
2. 


ie 


10. 


11. 


Construct a correlation table as in I. B. above. 
Select an assumed mean for each variable. Measure 
the deviations of the various items from the assumed 
means in class-interval units. 

Compute c, and cy, in class-interval units. 

Compute o, and ¢@, in class-interval units. 

Compute Z(z’y’) in class-interval units for each 
compartment of the correlation table. Total these 
figures to get 2(a’y’) for the whole table. 

Determine the value of the mean product in class- 
interval units from the formula 


> a’ / 
Compute r from the formula 
ein? 
T20y 


Reduce o, and g, to original units. 
Determine the equations of regression by _ substi- 
tuting the proper values in the formulas 
oe 
y=r = x 
and 
eens 
= C, Yy 
If desired, transfer the origin to zero on the two 
original scales from the formulas 


Vi area) 
Or 


FE Ve ay) 


Oy 


Compute the two standard errors from the formulas 


S,=0,V1—7 
S,;=0,%V l-?r 


It is advisable, in all cases, to construct scatter diagrams 
and to plot the lines of regression thereon. It is generally 


LINEAR CORRELATION 405 


possible to derive from such diagrams a truer idea of the 
relations involved, and of the adequacy of the methods 
employed, than may be obtained from a study of the figures 
alone. 

LIMITATIONS 


A question naturally arises as to the degree of generality 
attaching to the measures of relationship described in the 
preceding pages. Are they limited to certain types of dis- 
tributions, or may they be employed as absolutely general 
and universally valid measures? 

As we have seen, the standard deviation has a precise and 
definite meaning with respect to distributions following 
the normal law. Having values of the mean and of the 
standard deviation, we know, in such cases, the exact 
percentage of observations which will fall within any 
stated limits. If the distribution departs from the normal 
type the standard deviation is still a useful measure, but 
it cannot be interpreted in the same exact sense. Bearing 
this in mind, the formula 


may be considered. 

When the distribution of the original values of the 
dependent variable about their mean is normal and the 
distribution about the least squares line is normal, both 
S, and o, have specific and exact meanings, and it is per- 
fectly legitimate to compute such a measure as r, based 
upon the relation of one to the other. Departures from 
normality in either case reduce the significance of this 
comparison. But we have seen that the standard deviation 
remains a useful measure even though the departure from 
the normal type be fairly pronounced, though in the latter 
case, it lacks the precise significance attaching to it in a nor- 
mal distribution. In the same way the standard error and 
the coefficient of correlation may be computed and utilized, 
even when all the requirements of normality are not met. 


406 THE MEASUREMENT OF RELATIONSHIP 


Care must be taken in their interpretation in such cases, 
however. It must be clearly recognized that these measures 
have their full significance only in cases where the original 
distribution of the dependent variable and the distribution 
about the least squares line are both normal, or approx- 
imately so. 

A simple example may make clear the effect upon the 
value of the coefficient of correlation of an extreme departure 
from a normal distribution. In the following table are 
listed certain selected figures taken from the 1919 Census 
of Manufactures, for the State of New York. 


TABLE 102 


Wage Earners in Factories and Value of Products, 1919, in Eleven Cities 
in the State of New York 


Number of Wage Total Value of 
City Earners (in Products (in 
thousands) millions of dollars) 

(X) (Y) 
Batavia: saan ete occ wes 22 9 
Bed COM ssl n ae 4 5) 10 
Corning sy erere: 3.5 1l 
Genevayeard oe eer otis 10 
Glensetiallssae a ae 2.8 12 
ithaca ngaetetet ce eee Wes 10 
Middletown sven see 272 10 
Beckslilly wh. ws csnvea sees 2.1 ll 
Rensselacrsse eee ce 1.4 10 
Monawandae.) on eee 1.8 16 
NewYork ‘City Aiiciseec. 638.8 5,261 


When the first ten of these cities, in the order listed, are 
treated as a group, the following values are secured: 


o, = 1.8682 
S, = 1.8669 
r= — .034 


The ten points and the line of regression are plotted in 
Fig. 72. (No general significance is to be attached to the 


LINEAR CORRELATION 407 


above coefficient of correlation, for the cities were selected 
for the purpose of illustrating a particular point.) 


Millions 


15 17 19 21 23 25 27 29 31 33 35 
Thousands of Wage Earners 


Fic. 72. — Showing the Relation between Number of Wage Earners in Factories 
and Value of Products in Ten Selected Cities in the State of New York 


When New York City is included in the group, the values 
secured for the sample of eleven cities are 


o, = 1509.3 
Sy = 7.58 
r = +.999988 


The eleven points and the line of regression are plotted in 
Fig. -73. 

The reason for the markedly different results is obvious. 
When the one very large city is included with the ten small 
cities the standard deviations of both variables are greatly 
increased. That of the Y-variable (value of products) is 
increased from 1.8682 to 1509.3. But S,, the measure of 
the scatter about the fitted line, undergoes no such pro- 


408 THE MEASUREMENT OF RELATIONSHIP 


nounced change in value. For the ten cities it is 1.8669; 
for the eleven cities 7.53. This is due to the fact that the 
one exceptional case is given such a great weight, in fitting 
by the method of least squares, that the fitted line must 


34000 

s 

} 

3,000 

8 Insert shows lower left 
= 2,000 hand corner of chart 
iS magnified 80 times 


e) 100 200 300 400 500 600 700 
Thousands of Wage Earners 


Fic. 73. — Showing the Relation between Number of Wage Earners in Factories 
and Value of Products in Eleven Selected Cities in the State of New York 


pass through or very near the point representing this 
observation. Accordingly, S is always affected less than 
o by a single very exceptional case. Since the value of r 
depends upon the relationship 


the presence of such a case always tends to increase the 
value of the measure of correlation. The introduction of 
the one exceptional case in the above example changes a 
correlation coefficient of virtually zero to one of unity. 
The result, of course, is meaningless. 

While this example represents an extreme instance, the 


LINEAR CORRELATION 409 


same distortion will be felt, to a greater or less degree, 
whenever there is a departure from a normal distribution. 
In practice the various measures of relationship cannot 
be restricted to perfectly normal distributions, but they 
must be interpreted with care when there is reason to believe 
that such disturbing influences are present. 


REFERENCES 


Bow.ey, ArtHur L. Elements of Statistics (350-397). 

Brunt, Davin. The Combination of Observations (148-170). 

Cuappock, R.E. Principles and Methods of Statistics (Chap. 
XII). 

Experton, W. P. Frequency Curves and Correlation (106-124). 

Gatton, Francis. Correlations and their Measurement. Pro- 
ceedings of the Royal Society. Vol. XLV, 1888 (136-145). 

Jones, D. C. A First Course in Statistics (102-131). 

Kewury, Truman L. Statistical Method (151-195). 

Moorz, H. L. Forecasting the Yield and the Price of Cotton 
(12-51). 

Peart, Raymonp. Medical Biometry and Statistics (292-318). 

Prarson, Karu. Regression, Heredity and Panmizxia. Phil. 
Transactions, Royal Society. Series A. Vol. CLXXXVII, 
1896 (253-318). 

Rierz, H. L. anp Cratuorne, A. R. Simple Correlation (In 
Rietz, H. L. Handbook of Mathematical Statistics, 120-129). 

Ruac, H. O. Statistical Methods Applied to Education (233-307). 

Wuiraker, E. T. anp Rosinson, G. The Calculus of Observations 
(317-336). 

Yuur, G. U. An Introduction to the Theory of Statistics (157- 
209). 


CHAPTER XI 


THE MEASUREMENT OF RELATIONSHIP BETWEEN 
TIME SERIES 


The methods of measuring correlation which have been 
described in the preceding chapter were devised originally 
for the analysis of non-historical data, that is, for the 
treatment of frequency series rather than tome series. ‘The 
measurement of correlation between series in time presents 
certain distinctive problems which require separate treat- 
ment. 

We have seen that such series are affected by various 
forces, which have been classified as the secular trend, 
cyclical and seasonal fluctuations and accidental variations, 
and methods have been described by means of which the 
effects of these various forces may be isolated. This 
breaking up of a series into its component parts for separate 
study is essential in attempting to correlate series in time, 
for spurious and quite misleading results will be secured 
if this is not done. The problem of correlation is that of 
securing a precise measure of the degree of relationship 
between variable quantities. But each series in time 
represents the combination of a number of variables and, 
so far as possible, each should be treated separately in 
correlating such series. 

The relationship between two time series as, for example, 
interest rates and bond prices, may be studied with respect 
to any or all of the following components: 

a. Secular trend. 

b. Cyclical fluctuations. 

c. Seasonal fluctuations. 

d. Changes from one time unit to the next (e.g., week to week, 


month to month, or year to year). 
410 


BETWEEN TIME SERIES 411 


Such relationships may be studied, first, through the com- 
parison of graphs, and much may be learned by this simple 
process. The similarity or dissimilarity of secular trends, 
and the general relation between cyclical movements may 
be determined by a study of such graphs. For more 
accurate comparison the coefficient of correlation may be 
used, but when it is so employed it is particularly im- 
portant that the precise nature of its employment and the 
exact significance of the results be understood. 

For the comparison of secular trends the coefficient of 
correlation would never be employed. The mere fact 
that two series have the same secular trend is no indication 
of a relationship of interdependence; a coefficient of 
correlation based upon the trend values would be meaning- 
less. Moreover, much simpler methods are available for 
comparing trends. 

For the same reason a coefficient of correlation should 
not be based upon the original absolute values of two 
series in time, except in the rather rare case in which neither 
series is marked by a definite secular trend. The computa- 
tion of r, when dealing with ordinary statistical data, in- 
volves measuring the deviations of all the items from their 
respective arithmetic means, and securing the sum of the 
products of the paired deviations. When deviations of like 
sign are paired throughout r will have a positive value; when 
deviations of unlike signs are paired throughout r will have 
a high negative value. The presence of a pronounced up- 
ward or downward secular trend makes it impossible to 
secure significant values for r by the employment of this 
method. For example, the relationship between two series, 
as automobile production and the price of bacon between 
the years 1900 and 1920, might be measured. ‘The secular 
trend is markedly upward in both cases. When the devi- 
ations of the annual figures are measured from the arith- 
metic means of the two series, the paired items for the 
earlier years will be negative, for the later years positive. 


412 THE MEASUREMENT OF RELATIONSHIP 


A fairly high positive value for r would be secured, were 
the computation carried through on this basis. This value 
would be quite misleading, for no real relationship can be 
expected in this case. The coefficient of correlation in 
such a case would measure, primarily, the relation between 
the two secular trends. 

This coefficient might conceivably be employed to de- 
termine the similarity between seasonal fluctuations in 
two series, but its utility for this purpose may be questioned. 
Here again other and simpler methods are available. 

In practice, therefore, the device of correlation should 
be employed neither to measure the relation between 
secular trends nor between seasonal movements. Its use 
is confined to comparisons of two or more series with 
respect to cyclical fluctuations and with respect to the 
short time changes from month to month or year to year. 
And, if valid measures of correlation are to be secured in 
making such comparisons, the effects of other forces which 
are not being studied must be eliminated, in so far as this 
is possible. The actual work of correlation must be pre- 
ceded by a sifting process designed to remove all such 
irrelevant material. Unless the data are thus “‘distilled”’ 
the interpretation of the resulting coefficients will be 
difficult. 


Tue MEASUREMENT OF CORRELATION BETWEEN 
CycuicaL FLUCTUATIONS 


In an earlier chapter we have dealt with methods by 
which the effects of certain of the factors affecting time 
series might be measured and eliminated. The spurious 
correlation due to secular trend may be avoided by measur- 
ing the deviations of the observations not from the respec- 
tive arithmetic averages but from the lines of secular trend 
of the two series. These variations, the deviations from 
normal, are the significant values if our interest centers 


BETWEEN TIME SERIES 413 


in the cycles. If annual values are employed the problem 
of eliminating seasonal fluctuations is not faced. 

To illustrate this method of measuring the relationship 
between series in time we may undertake to determine 
whether there is any connection between cyclical fluctua- 
tions in cotton production and in cotton prices. Figures 
for crop years are to be employed, for the period 1900-01 
to, 1922-23. 

Cotton prices require some correction before correlation 
is attempted. The raw figures with which the investigation 


Fic. 74. — Cotton Production in the United States, Crop Years 1900-01 to 
1922-23, with Lines of Trend 


starts are average spot prices at New York for middling 
upland cotton, at wholesale, from September to May of 
each crop year. But such prices reflect not only the effects 
of varying conditions in the cotton market, but also changes 


414 THE MEASUREMENT OF RELATIONSHIP 


in the general level of prices. To eliminate the effect 
of this factor the original prices are deflated by Bradstreet’s 
price index, as computed for the September-May period 
in each crop year. For this purpose Bradstreet’s index 
has been reduced to relative terms, with the average for 
the crop year 1913-14 equal to 100. The original figures 
for the two series to be correlated, together with the 
corrected price figures, are given in Table 103. 


TaBLe 103 
Cotton Production and Cotton Prices, 1900-1923 


(1) (2) (8) (4) ) 
Cotton Prices. 

Average of spot s 

Cotton Produc-|~. -*. Bradstreet’ s : 

tion in United | PT * N.Y.) Dice Index | Cotton Prices, 


Crop Year _|States, excluding f cs are Average, Sept. alee 
linters (in thou-| “P*On® coon, to May im cents per 


sands of bales) ae ve oe (1913-14 =100) pound) 


pound) 
1900-01 10,123 9.58 84.8 L136 
1901-02 9,510 8.64 86.2 10.02 
1902-03 10,631 9.50 90.0 10.56 
1903-04 9,851 13.20 88.6 14.90 
1904-05 13,438 8.69 89.3 9.73 
1905-06 10,575 11.40 92.3 12.35 
1906-07 13,274 10.97 98.8 11.10 
1907-08 11,107 11.41 93.2 12.24 
1908-09 13,242 9.81 91.3 10.74 
1909-10 10,005 14.62 100.6 14.53 
1910-11 11,609 14.80 97.8 15.13 
1911-12 15,693 10.34 100.0 10.34 
1912-13 13,703 12.35 104.8 11.78 
1913-14 14,156 13.40 100.0 13.40 
1914-15 16,135 8.63 105.2 8.20 
1915-16 11,192 12.04 121.2 9.93 
1916-17 11,450 18.29 151.0 12.11 
1917-18 11,302 29 .96 197.9 15.14 
1918-19 12,041 30.06 203.1 14.80 
1919-20 11,421 38 .63 226.3 17.07 
1920-21 13,440 16.90 152.9 11.05 
1921-22 7,954 18.67 127.2 14.68 
1922-23 9,762 26 .26 149.7 17.54 


BETWEEN TIME SERIES 415 


These data are plotted in Figures 74 and 75. Two lines 
of trend which were fitted to each series are shown on 
each chart. 


Deflated Price 


Fie. 75. — Prices of Middling Upland Cotton in New York, Crop Years 1900-01 
to 1922-23, with Lines of Trend. (Figures relate to average annual prices, 
during the crop years, deflated by Bradstreet’s index of wholesale prices) 


The deviation of each annual item from the secular trend 
of the given series is now to be measured, and the coeffi- 
cient of correlation between these deviations is to be 
calculated. Deviations from the two third degree parab- 
olas are first correlated... The computations appear in 
Table 104. 


1 It is not, of course, essential that the trend lines be of the same type for the 
two series being correlated. 


416 THE MEASUREMENT OF RELATIONSHIP 


TABLE 104 


Computation of the Coefficient of Correlation, Cotton Production and 
Cotton Prices 


(1) (2) (3) (4) (5) (6) 
Deviation of Deviation of 
Cotton Produc- | deflated Cotton 
tion from para- | Price from para- 
Year bolic line of | bolic line of trend 
trend (in thou- | (in cents per 


sands of bales) pound) 
x y so y¥ ry 

1900-01 +371 +.79 137,641 6241 +293 .09 
1901-02 —603 —1.05 363,609 1.1025 +633 .15 
1902-03 +145 — .93 21,025 .8649 —134.85 
1903-04 —1,012 +3.11 1,024,144 9.6721 —3,147 .32 
1904—05 +2,200 —2.26 4,840,000 5.1076 —4,972 .00 
1905-06 —1,027 + .24 1,054,729 .0576 —246 .48 
1906-07 +1,325 —1.06 1,755,625 1.1236 —1,404.50 
1907-08 —1,164 +.08 1,354,896 .0064 —93.12 
1908-09 +681 -1.38 463,761 1.9044 —939 .78 
1909-10 —2,806 +2 .47 7,873,636 6.1009 —6,930 .82 
1910-11 —1,405 +3 .14 1,974,025 9.8596 —4,411.70 
1911-12 +2,530 -1.60 6,400,900 2.5600 —4,048 .00 
1912-13 +452 -.13 204,304 .0169 —58.76 
1913-14 +887 +1.47 786,769 2.1609 +1,303 .89 
1914-15 +2,923 —3.81 8,543,929 | 14.5161 | —11,136.63 
1915-16 —1,878 —2 4 3,526,884 5.0176 +4,206 .72 
1916-17 —1,388 — .30 1,926 544 .0900 +416 .40 
1917-18 —1,206 +2 .37 1,454,436 5.6169 —2,858 .22 
1918+19 -31 +1.56 961 2.4336 —48 .36 
1919-20 —102 +3 .21 10,404 | 10.3041 —327 .42 
1920-21 +2,586 —3.59 6,687,396 | 12.8881 —9,283 .74 
1921-22 —2,104 —.91 4,426,816 .8281 +1,914.64 
1922-23 +636 + .82 404,496 6724 +521 .52 

Totals 0 0 55,236,930 | 93.5284 | —40,752.29 

oer /55,236,930 = 1,550 
23 
Cys \) ee = 8Oly 
23 


p= Z(ey) _ — 40,752.29 
No.0, 23 X 1,550 xX 2.017 
= —.567 


BETWEEN TIME SERIES 417 


This value of — .567 for the coefficient indicates a fair 
degree of negative correlation between deviations of cotton 
production in the United States from the line of trend 
and the corresponding deviations of cotton prices in New 
York, during the period covered. A somewhat higher 
value could undoubtedly have been secured had the rather 
abnormal war years been omitted, but there are some 
objections to such an omission. 

From the values already computed we may derive an 
equation for estimating the variation in cotton price 
associated with a given variation in production. This 
regressive equation, as we have seen, is of the type 

o 
Das eZ 
In the present case y and z refer to deviations from the 
parabolic lines of trend. Substituting the given values we 
have 


2.017 
y= BINA pret 
y = — .0007378x 


This equation means that, on the average, a unit devia- 
tion of cotton production (x) above the line of trend was 
accompanied by a deviation of .0007 units in cotton prices 
(y) below the line of trend. The unit employed in the 
production figures was 1000 bales, in the deflated price 
figures, one cent. In the interpretation of the equation 
it may be simpler to use an z-unit of one million bales, 
making the equation of regression 

y = — .1378x 


Thus a cotton crop one million bales above normal was 
accompanied by prices about three quarters of a cent per 
pound below normal (with reference always to deflated 
prices). This was the average relationship during the 
period 1900-1923. It did not hold in all cases, as is shown 
by the fact that r has a value of but —.567. If this, or 


418 THE MEASUREMENT OF RELATIONSHIP 


a similar law, held perfectly, r would have a value of — 1. 
The value of S, which measures the scatter about the 
line of regression, may be computed from the formula 


S,=o,Vlor 


In the present case, S, has a value of 1.66 cents. The 
significance of this measure has been explained in an 
earlier section. 

(It should be emphasized that the use of the above 
equation for estimating future prices is dependent upon 
the validity of projecting the two lines of secular trend.) 

In the preceding analysis deviations were measured in 
absolute units, and the results could be interpreted only 
in terms of absolute units, bales of cotton and cents per 
pound. For certain purposes it might have been more 
convenient to correlate percentage deviations from the 
two lines of trend, in which case the standard deviations 
and the equation of regression would have been expressed 
in these terms. The procedure, in this respect, will generally 
depend upon the use to which the results are to be put. 

It is obvious that in the above problem there is an 
arbitrary element which was not present in the correlation 
problems previously studied. The deviations are measured 
from lines of trend, not from the arithmetic means, and 
these lines of trend are arbitrarily selected. The use of 
different lines of trend might give quite different results. 
In the above example the lines of trend were both third 
degree parabolas. We may, with perhaps equal reason, 
assume that the underlying trend is best described by a 
straight line in each case. When such lines are fitted, and 
the deviations from these linear trends correlated, a value 
of —.608 is secured for the coefficient of correlation, 
with S, having a value of 1.72 cents. (These results may 
appear inconsistent with those secured when deviations 
from the parabolic lines of trend are correlated, for both 
S, and r are greater in the case in which linear trends are 


BETWEEN TIME SERIES 419 


assumed. The explanation is found in the fact that o, is 
also greater in the case of the linear trend; the value of 1, 
of course, depends upon the relation between S, and ¢,.) 


DIFFICULTIES IN THE CORRELATION OF TIME SERIES 


Which of these coefficients gives the true measure of 
the relationship between cyclical fluctuations in the two 
series? This question, unfortunately, cannot be answered 
definitively, and in this fact is found one of the chief diffi- 
culties faced in the correlation of time series. If we could 
say with certainty which of the two curves we have fitted 
to each series is the best measure of trend the question 
could be answered, but there is no final objective test to 
employ. It would appear from inspection that the curve 
of higher degree represents the trend rather better within 
the limits of the data in each case, but that is not capable 
of proof. Certainly it does not follow that the lines of 
trend which give the highest coefficient of correlation are 
the best. A high value for the coefficient may mean only 
that the two lines of trend are bad fits, both erring in the 
same way.! 

The presence of this arbitrary element in the correlation 
of deviations from lines of secular trend detracts somewhat 
from the confidence that may be placed in the results. 
With the selection of different lines of trend a number of 
quite different results may be secured for the coefficient 
of correlation between cyclical fluctuations in two series. 
The critical problem in such a case lies not in the mechanical 
process of correlation, but in the choice of an appropriate 
line of trend for each series. If, by the tests of inspection 
and of correspondence with such external evidence as may 
be available, it appears that the curve selected accurately 
represents the trend in each of the series correlated, the 
coefficient may be accepted as significant. But, in the 


1 Cf. W. M. Persons, “Correlation of Time Series,” Journal of the American 
Statistical Association, June, 1923, 7265. 


420 THE MEASUREMENT OF RELATIONSHIP 


interpretation and use of the results, the presence of this 
element of personal judgment in the preliminary calcula- 
tions must not be forgotten. This applies with particular 
force if the study aims to establish a functional relation- 
ship between cyclical fluctuations in the two series, and if 
an estimating (or regression) equation is to be based upon 
the results. 


Tue COEFFICIENT OF CORRELATION AND THE 
MEASUREMENT OF TIME SEQUENCE 


In the correlation of cotton production and cotton prices 
the object was to measure as accurately as possible the 
effect of variations in cotton production upon cotton 
prices. An equation was secured which described this relation 
when deviations were measured from the particular lines 
of trend employed. Cotton prices were considered to be a 
function of cotton production, and the object of the study 
was to measure this functional relationship. We seek, 
in such cases, to determine the degree to which cycles 
in one series depend upon or reflect cycles in a related 
series, assuming some functional relationship between 
them. This is essentially the problem described in intro- 
ducing the subject of correlation, and generally consti- 
tutes the major problem in studying the relation between 
series of any type. 

But a second and somewhat different problem may be 
faced in certain studies of time series. Assuming that 
two such series are marked by definite cycles, it is of 
interest to determine whether the cycles coincide in time, 
or whether cycles in one series consistently precede or 
lag behind cycles in the other. The coefficient of correla- 
tion has been found very useful in determining the degree 
of “lead” or “lag”? in such cases. This problem is that 
of determining merely temporal relationship, as opposed 
to the functional relationship which is ordinarily to be 
measured. 


BETWEEN TIME SERIES 421 


Tuer RELATION BETWEEN Stock Pricr Cycites AND CYCLES 
oF Business ACTIVITY 


To illustrate the solution of a problem of this latter 
type we may undertake to determine the relation, in 
time, between cyclical movements in industrial stock prices 
and in general business activity, as measured by the 
composite index compiled by the American Telephone 
and Telegraph Company. The monthly values of this 
index for the period 1877-1923 have been presented in 
an earlier section. The figures relating to stock prices are 
given in the following table: 

TABLE 105 


Cycles in Industrial Stock Prices, 1903-19231 
(Figures relate to deviations from normal in units of the standard deviation.) 


1903 | 1904 | 1905 | 1906 | 1907 | 1908 | 1909 | 1910 | 1911 | 1912 } 1913 


January........ —.2| —2.0) — 1] +2.3] +1.6] —1.3}: +.5) +1.0) —.1 | —.4 —.2 
February....... —.1) —2.1) +.2) 42.2) +1.4) —1.5) +.3) +.5) +.1] -—.4 —.5 
BR ar ely 4. evetersio\'s —.3) —2.1] +.6) +1.9) +.6) —1.1] +.3] +.8} —.1 —.1 —.7 
PANT aca; ane oleye, svete —.5| —2.0) +.7) +1.7) +.6/ —.8} +.5) +.5] —.1 ] +.83 —.6 
May caters Siereieie —.5} -2.1] +.2) +1.4/ +.4) -.5) +.8) +4 0} +.2 ay 
URE Sncsieevereew —.9) —2.1) +.3) +1.5) +.2) —.5) +.9] +.1] +.1 +.2] -1.1 
MeL Yerevscs olaraisi> © —1.4) —1.8} +.7/ +1.3] +.3) —.2) 41.1) —.4) +.1] +.2 —.9 
US OS bocce e:asie)2 —1.7) —1.6] +.8] +1.7) —.38) +.3]/ +1.4) —.3] —.3 ] +.8 = 7 
September...... —1.9] —1.3) +.7) 41.7) —.5] +.1) 41.4) —.3) —.7 ] +.4 —.5 
Oetober (2.5.1. —2.3| —.9} +.8] +1.7| —1.3) +.2) +1.4 0} =.7 | -+.38 —.8 
November...... —2.4| —.3) 41.1] +1.7) —1.9] +.5} 41.4) +.1) —.5 | +.2 -—.9 
December...... —2.1) —.1| +1.8} +1.7| —1.6; +.5] +1.3) —.2! —.4 0 -—.9 
1914 | 1915 | 1916 | 1917 | 1918 | 1919 | 1920 | 1921 | 1922 | 1923 
PABUATY 10's 0 aie so —.7 |—2.42} +.42) +.61/—1.00} —.56)/+1.16]—1.04} —.61] +.66 
February....... —.6 |—2.47) +.33| +.15] —.68} —.53] +.40)/—1.01] —.37] +.95 
1 E70 eee Ae —.6 |—2.30) +.31) +.41] —.80} —.13] +.78/—1.02] —.13}4+1.14 
/ v0 We pOcaee a —.8 |—1.69) +.05| +.36) —.86] +.14) +.88} —.92! +.19} +.93 
WTS 0.2 states cas —.8 |—1.72| +.08) +.32) —.61] +.77) +.17]) —.88] +.34] +.53 
OUME Aecleiavels Bea —.7 |—-1.54) +.11] +.59] —.63}+1.22) +.17/—1.49] +.32] +.35 
RMU eee? sree sterstols —1.1 |—1.28]} —.04) +.29) —.59/+1.56} +.11]/—1.54) +.45) +.01 
PRISE. iciaisce.sici> aa —.74| +.15] —.03] —.54)/+1.01] —.27]/—1.66] +.69] +.13 
September...... * —.27| +.62| —.41] —.50}/+1.38] —.16)/—1.41] +.81] +.09 
October........ ty +.25| +.97) —.75| —.22}/+1.86] —.31]—1.32] +.84] —.11 
November...... fy +.40]+1.40] —1.32} —.35/+1.62} —.80} —.97) +.51] +.14 
December...... —2.54) +.58] +.70}—1.41] —.50}+1.24)—1 28| —.69| +.64) +.37 


* Stock Exchange closed. 
1 These figures, the results of analyses made by W. M. Persons, are from the 


Review of Economic Statistics, published by the Harvard Committee on Economic 
Research. For the period January, 1903, to July, 1914, they are based upon the 
average price of 12 industrial stocks; for the period since December, 1914, they 
are based upon the average price of 20 industrial stocks (the Dow-Jones index). 


422 THE MEASUREMENT OF RELATIONSHIP 


The data of the two series are plotted in Fig. 76. From 
a comparison of the two curves in this chart it is clear 
that there is some relation between the movements in 
the two series; but such a comparison affords no basis 
for a definite conclusion. Our object is to determine 
whether the cycles in the two series are exactly synchro- 
nous and, if they are not, to measure the average time 
interval by which cycles in one series precede the cycles 
in another. The significance of such studies in the analysis 
of the business cycle is obvious. 

For the purpose of this investigation data for the period 
from January, 1903, to June, 1914, may be employed. 
The war years are omitted because the two series, during 
these years, were affected by the abnormal conditions 
prevailing. 

A coefficient of correlation is first computed for con- 
current items. A value of +.55 is secured. Next, the 
data are correlated with industrial stock prices preceding 
general business by one month. That is, the January, 
1903, figure for stock prices is multiplied by the February, 
1903, index of general business; the February stock price 
is multiplied by the March business index, ete. This 
process is carried through for the entire period from Janu- 
ary, 1903, to June, 1914. Only’ 137 monthly values are 
used in this computation, as compared with 138 in the 
preceding case, for the January, 1903, business index and 
the June, 1914, stock price figure do not enter into the | 
calculations. Accordingly the values ¢, and c, (the two 
corrections to be applied because the origin does not 
coincide with the two averages) and the two standard 
deviations will be slightly different. These corrections 
may be readily made. The coefficient of correlation secured 
from these computations has a value of +.65. The same 
operation is repeated with other pairings of the two vari- 
ables. The results are summarized below. 


423 


BETWEEN TIME SERIES 


S8-GOGL ‘AWAQOYW ssouisng e1ouexy) UI pu sadIIg Yoo}S [elIsnpuy ur suorengonpy [eoyoAD oy} Jo uostredui0oy — ‘9, ‘DLT 


Sra! py \_|90- 
 pasoly abueysxq Yoos 4 [\\ ~ 
Ai a ef 
H f OI- B 

ro 
oy eS 
$ 3 
pay 2 
Wi Olt & 
WH : 
OU+ 
S201 
4909S 


Sssoutsng [etsuaD | cae . TeHsnpuy jo xopuy 
JO Xapu] to aIeog 4OJ 2IL0S 


424 THE MEASUREMENT OF RELATIONSHIP 


TasLE 106 


Coefficients of Correlation between Industrial Stock Prices and an Index 
of General Business Activity 
(Based upon data for the period 1903-1914.) 
Coefficient of Correlation 
Stock prices concurrent with business index 
Stock prices preceding business index by one month 
“e ee “ce “e “ce “ce 2 months 


“se “eé ee ee “e “ce 3 “<e 


+++++++4+4+444 
> 


These figures are plotted in Fig. 77. 
The coefficients increase to a maximum value of +.76 
which is secured with stock prices preceding general business 


 Fextes RANI CE De 
OA 1h CS aS Oe io alma 
Lag in months 


Fic. 77. — Coefficients of Correlation between Index of Industrial Stock Prices 
and Index of Business Activity, 1903-14, Showing the Results Secured with 
Different Pairings 


(In all pairings except that of concurrent items the business activity index follows the 
stock price index) 


BETWEEN TIME SERIES 425 


EN 
cauuaceaase 
Ee aaa 
JA eee 


\ 


ee 2 On ew 1 ON wl Om eo. aloe 
Lag in months 


Fic. 78. — Coefficients of Correlation between Index of Industrial Stock Prices 
and Index of Business Activity, 1919-23, Showing the Results Secured with 
Different Pairings 


(In al] pairings except that of concurrent items the business activity index follows 
the stock price index) 


by 4, 5 and 6 months. The stability of the coefficients 
with the period of “lead” varying from 3 to 7 months 
indicates that there is no one specific interval, within the 
limits thus indicated, between the cyclical movements of 
these two series. From the results here given it would 


426 THE MEASUREMENT OF RELATIONSHIP 


appear that five months is the average interval by which 
stock prices precede the general business index, but this 
is not sharply marked off as a constant relationship. 

When the same figures are correlated for the post war 
period, from January, 1919, to December, 1923, somewhat 
different results are secured. ‘These are presented below; 
they are plotted in Fig. 78. 


TABLE 107 


Coefficients of Correlation between Industrial Stock Prices and an Index 
of General Business Activity 


(Based upon data for the period 1919-1923.) 
Coefficient of 


Correlation 
Stock prices concurrent with business index + .75 
Stock prices preceding business index by one month + .83 
Y ae v ie ee eamonths + .87 
“ee “cc “ee “eé “ce “ce 38 “ce | 88 
“ce “ce “ “e “ce “ee 4 “e ae 85 
“ 6é “ee “ec “ec “cc 5 “ce + 82 
“é “ee “ec ee oe e 6 “e ails V7 
“ “ee “ ee “ “ce Lyf ee = 72 
“e “ee “ee “ee ““ “ce 8 “ec + 66 
“ee “e “se “ce “ “e 9 “se ate 57 
ee “e “e “ce “ee “ 10 “ee ate 46 
“cc “e “e ““é ee “ee 11 ee + 33 


The period covered in this latter study is so short that 
no final conclusions may be drawn. The results indicate 
that since the war the movements of general business have 
followed more closely behind stock price movements than 
during pre-war days. A maximum correlation is secured’ 
when stock prices precede the business index by three 
months. 


Tue Use or tur Movina AVERAGE IN CORRELATING 
Cycues IN TIME SERIES 


The preceding discussion has dealt only with cycles as 
measured from mathematically fitted lines of trend. But 
trend may be measured, as we have seen, by lines based 


BETWEEN TIME SERIES 427 


upon moving averages, and the cyclical deviations from 
such lines may be correlated in precisely the same way as 
deviations from other lines of trend. The arithmetic 
mean of the deviations from such moving averages will not 
necessarily be zero, as in the case of deviations measured 
from lines fitted by the method of least squares, and a 
corresponding correction must be made in correlating such 
figures. 

Moving averages are subject to the same criticism as 
are mathematical lines of trend. There can be no cer- 
tainty that deviations from lines of trend based upon 
moving averages represent the effects of cyclical causes 
solely. The result in a given case depends upon the period 
of the moving average employed, and there is no perfect 
criterion by which to determine the best measure of trend. 
Significant and useful coefficients may be computed when 
deviations are measured from moving averages, but the 
presence of an arbitrary element in the work must be 
recognized and the results applied with corresponding 
reservations. 


Tue CORRELATION OF SHORT TERM FLUCTUATIONS 


In describing the variable factors that constitute compo- 
nent elements of the values of a series in time, it was pointed 
out that the coefficient of correlation would not generally 
be employed in comparing either the secular trends or the 
seasonal fluctuations of two series. It may be used to ad- 
vantage in measuring either functional or temporal relations 
between cyclical fluctuations, provided that the effects of 
the other variables have been, so far as possible, eliminated. 
The coefficient of correlation and the measures which are 
employed in conjunction with it have a further use in 
dealing with time series. They may be used to measure 
the relation between short term changes in two series, 
changes from year to year, month to month, or even from 


428 THE MEASUREMENT OF RELATIONSHIP 


week to week or day to day, if desired. This problem is 
distinct from that studied in the preceding section and in 
the interpretation of the results the two should not be 
confused. 


TABLE 108 


Computation of Coefficient of Correlation between Cotton Production 
and Cotton Prices, 1901-1923 


(Based upon first differences.) 


(1) (2) (3) (4) (5) (6) 
Difference be- | Difference be- 
tween produc- | tween price in 
tion in given | gwen year and 
Crop year and ‘pro- | ‘price in pre- . 
Year | duction in pre- |\ceding year (in 
ceding year (in |cents per pound, 
mulions of bales) deflated) 
€ Ne 


x x? Vee XY 

1901-02 — .613 —1.28 . 3875769 1.6384 + . 78464 
1902-03 +1.121 +.54 1.256641 .2916 + .60534 
1903-04 — .780 +4. 34 . 608400 18 .8356 —3 .38520 
1904-05 + 3.587 —5.17 12 .866569 26 .7289 —18 54479 
1905-06 —2.863 +2 .62 8.196769 6.8644 —7.50106 
1906-07 +2 .699 —1.25 7.284601 1.5625 —3 .37375 
1907-08 —2.167 +1.14 4.695889 1.2996 —2.47038 
1908-09 +2 .135 —1.50 4 558225 2.2500 —3 20250 
1909-10 —3 .237 +3 .79 10 .478169 14.3641 —12.26823 
1910-11 +1. 604 + .60 2.572816 . 38600 + .96240 
1911-12 +4 .084 —4.79 16 .679056 22.9441 —19 .56236 
1912-13 —1.990 +1.44 3.960100 2.0736 —2.86560 
1913-14 + .453 +1.62 . 205209 2.6244 + .73386 
1914-15 +1.979 —5.20 3.916441 27 0400 —10.29080 
1915-16 —4 943 +1.73 24 433249 2.9929 —8 .55139 
1916-17 + .258 +2.18 066564 4.7524 + .56244 
1917-18 — .148 +3 .03 .021904 9.1809 — 44844 
1918-19 +.739 — .34 546121 .1156 — .25126 
1919-20 — .620 +2 .27 . 384400 5.1529 —1.40740 
1920-21 +2 .019 —6.02 4.076361 36 2404 —12.15438 
1921-22 —5 .486 +3 .63 30 096196 13.1769 —19.91418 
1922-23 +1.808 +2 .86 3.268864 8.1796 +5 .17088 

—22 .847 —25 .55 140 .548313 | 208.6688 | —117.37216 

+22 .486 +31.79 


BETWEEN TIME SERIES 429 


TaBLe 108 — Continued 


pees 80h 0108 Ty = V9.40429 

5 4 o, = 3.067 
Cf = cian - Saver 
= = + .284 N pee 

— 117.37216 

cy = .080656 ee es he AK) 
—— /140 548313 | 00027 = — §5.3304404 

; 22 5 

= V6. 38829 ate 
oz = 2.528 — 53304404 


4 


7.153876 
208.6 
o,- Vo — .08066 = —.687 


There are several ways in which the problem of com- 
paring short term fluctuations may be attacked. The 
absolute differences between successive items in two series 
may be correlated, or these differences may be expressed 
as percentages or ratios. Table 108 illustrates the pro- 
cedure employed in measuring the correlation between the 
absolute fluctuations from year to year (first differences) of 
cotton production and cotton prices. The original values 
from which the items in columns (2) and (3) are derived 
are given in Table 103. 

The process of computing r is identical with that 
employed in preceding examples, when deviations were 
measured from an arbitrary origin. The arbitrary origin 
in this case is zero, but corrections must be made in the 
various values since the algebraic sum of the given figures 
is not zero in either case. A value of —.687 is secured for r. 

The equation of regression and the value of S, as com- 
puted from the usual formulas are 

y = .8835x 
S, = 2.23 cents 


A comparison of the different results secured in the 
preceding examples relating to cotton throws some inter- 


430 THE MEASUREMENT OF RELATIONSHIP 


esting light upon the general problem of correlation. In 
fact we have measured the correlation between three 
different things which are not strictly comparable — devia- 
tions from third degree parabolas, deviations from straight 
lines of trend, and year to year fluctuations in the pro- 
duction and price of cotton. Yet, if we were seeking to 
estimate the price of cotton which would accompany a 
given crop, an estimate might be based upon any one of 
the three studies, the results of which are given below. 


r Sy 

Correlation of cycles in cotton production 

and prices (deviations measured from third 

degreespata bolas) pam ressr eee aes —— OO" 1.66 cents 
Correlation of cycles, same data (deviations 

measured from straight lines of trend).... — .608 1.72 cents 
Correlation of year to year fluctuations, same 

ATA ges hao Retiak cout Subic Coenen ees — .687 2.23 cents 


The value of 7 in the latest example is greater than the 
values secured in the other cases, though the standard error 
is also larger. The reason for this apparent contradiction 
has been suggested above; the standard deviation of the 
year-to-year fluctuations is greater than the standard 
deviations about the linear and parabolic trends. 

At first glance it would appear that the error of estimate 
would be least if based upon the results secured when 
deviations from third degree parabolas were correlated, and 
greatest when based upon the study of year-to-year move- 
ments. But there is a concealed assumption in the first 
case, the assumption that the lines of trend of both prices’ 
and production may be projected beyond the period 
studied. There is an immeasurable margin of error in this 
assumption, and the standard error of estimate, accord- 
ingly, does not give a true measure of the probabilities 
involved. The same is true with regard to the linear trends. 
No such assumption is involved in the measure based upon 
year-to-year fluctuations, though a single exceptional year 
might introduce greater difficulties here than in either of 
the other cases. 


BETWEEN TIME SERIES 431 


REFERENCES 


Moors, H. L. Economic Cycles: Their Law and Cause. 
Moorzg, H. L. Forecasting the Yield and the Price of Cotton. 
Moors, H. L. Generating Economic Cycles. 

Persons, W. M. Correlation of Time Series. Journal of the 
American Statistical Association, June, 1923. (This article 
is also published in Rietz, H. L., Handbook of Mathematical 
Statistics 150-165.) 

Persons, W. M. Indices of Business Conditions. Review of 
Economic Statistics, Prel. Vol. I. 1919. 

Persons, W. M. The Variate Difference Correlation Method and 
Curve Fitting. Quarterly Publications of the American 
Statistical Association, June, 1917. 

Snow, E. C. Trade Forecasting and Prices (with discussion). 
Journal of the Royal Statistical Society, May, 1923 (332-398). 

Sramp, J. C. The Effect of Trade Fluctuations Upon Profits 
(with discussion). Journal of the Royal Statistical Society, 
July, 1918 (563-608). 

Yuutn, G. U. On the Time Correlation Problem, with Especial 
Reference to the Variate Difference Correlation Method (with 
discussion). Journal of the Royal Statistical Society, 
July, 1921 (497-537). 


CHAPTER XII 


THE MEASUREMENT OF RELATIONSHIP: 
NON-LINEAR CORRELATION 


In the preceding chapters the discussion has been con- 
fined to cases in which the relationship between two 
variables may be described by a straight line. The coeff- 
cient of correlation, 7, is a measure of the degree to which 
two variables approach a linear relationship and it is 
significant only when a straight line gives a good fit to the 
points representing the paired values of X and Y. 

In fitting curves to time series, as explained in an earlier 
section, it is found that in many cases the trend is non- 
linear, and that a curve of higher degree is needed. The 
same thing is true in the field of our present discussion. 
It is possible to have a high degree of correlation between 
two variables when a straight line does not describe the 
relationship. In such a case there would be considerable 
scatter about the straight line of best fit, and the value 
of r would be misleadingly low. If a curve representing 
the real relationship could: be fitted, the scatter would 
be materially reduced and the true correlation could be 
measured. The figures presented in Table 109 on page 
433 illustrate such a case. 

These data are plotted in Fig. 79. 

Two different curves have been fitted to the points 
plotted in this figure. One is a straight line having the 
equation 

Y = 5.038 + .0886X 


in which Y represents yield, in tons per acre, and X repre- 
sents depth of irrigation water applied, in inches. The 


degree of relationship between the two variables, as de- 
432 


NON-LINEAR CORRELATION 433 


TaBLe 109 
Alfalfa Yield and Irrigation 
(Summary of Investigations at Davis, California, 1910-1915) 1 


Depth of Yield in tons per acre 
Water { 
Averages 
applied 
(inches) | 1910 | 1911 | 1912 | 1913 | 1914 | 1915 
0 3.85 5.94 5.52 2.15 2.89 2.35 3.88 
12 4.78 7.52 6.51 4.31 5.83 4.84 5.63 
18 = 7.02 5.69 8.02 6.46 6.80 
24 6.00 8.38 8.32 6.89 9.96 7.96 7.92 
30 7.53 9.54 9.43 7.97 11.06 8.32 8.98 
36 7.58 9.33 9.38 8.22 12.48 8.63 9.27 
48 8.45 9.52 8.63 8.83 10.62 8.05 9.02 
60 og x 10.17 7.25 10.70 5.55 8.42 


Inches of Water Applied by Irrigation 


Fic. 79. — Scatter Diagram Showing the Relation between Alfalfa Yield and 
Irrigation Water Applied, with Two Lines of Regression 


1 This table is taken from “The Economical Irrigation of Alfalfa in the Sacra- 
mento Valley” by S. H. Beckett and R. D. Robertson, Bull. No. 280, Agricultural 
Experiment Station, Univ. of California, May, 1917. 


434 THE MEASUREMENT OF RELATIONSHIP 


scribed by this line, is indicated by the coefficient of 
correlation, 7, which has a value of +.68. 

An inspection of the figure shows clearly that the 
straight line does not give the best possible fit. It is certain, 
therefore, that r does not furnish a valid measure of the 
degree of relationship between alfalfa yield and depth of 
irrigation water. 


PARABOLIC RELATIONSHIP 


The other curve is a second degree parabola, fitted by 
the method of least squares. The equation to this curve is 


Y = 3.55 + .252X — .002816.X? 


It is obvious that the effect of increasing irrigation upon 
alfalfa yield is described much more accurately by this 
latter curve than by the straight line. The most important 
result of these investigations was the determination of 
the point at which alfalfa yield began to fall off with in- 
creased applications of water, and the straight line fails 
to indicate any such decline. 

As the equation of relationship, therefore, we should use 
the parabolic rather than the linear form. The standard 
error, S,, which is a necessary accompanying measure, 
may be calculated by measuring the deviation of each 
value from the corresponding computed value, and de- 
termining the root-mean-square of these deviations. This 
procedure is illustrated in the following table. The figures 
for normal yield which are given in this table are computed 
from the parabolic equation given above. 


(1) 


Depth of 
Trrigation 
W ater 


NON-LINEAR CORRELATION 


TABLE 110 


Comparison of Actual and Computed Alfalfa Yield 
(2) 


Actual Yield 


—_ 


_ 


(3) 
Normal Yield, 
as computed 
from parabolic 
equation 


on 


Q 


15 


(4) 
Deviation of 
actual from 
normal 
(2)-(3) 

d 
+ .30 


+2.39 5. 
+197, 3. 


— .80 
— .66 
—1.20 
—1.38 
+1.36 
+.35 


—1.85 3. 


— .33 


—1.32 1. 


11855 


—1.48 2. 


+ .85 
—.71 


—1.97 3. 


+3.51 12. 


(Continued on page 436) 


— at et 


(5) 


435 


436 THE MEASUREMENT OF RELATIONSHIP 


TasLe 110 (Continued) 
Comparison of Actual and Computed Alfalfa Yveld 
(1) (2) (3) (4) (5) 


48 8.63 9.15 — 52 2704 
48 8.83 9.15 oe 1024 
48 10.62 9.15 +1.47 2.1609 
48 8.05 9.15 —1.10 1.2100 
60 10.17 8.53 +1.64 2.6896 
60 1.25 8.53 —1.28 1.6384 
60 10.70 8.53 +2.17 4.7089 
60 5.55 8.53 —2.98 8.8804 
+2422 80.9871 
24.21 


Inserting the sum of the squared deviations in the 


formula we 
d2 
8-7 


80.9871 
cae yee = 1.96 
y, 44 


we have 


Tue INDEX oF CORRELATION 


We need now the third value, the abstract measure of 
degree of relationship. In dealing with cases of linear 
relationship in the preceding chapter we found that such 
a measure, the coefficient of correlation, could be derived 
from known values of S, and o,. An analogous measure 
may be derived in the same way in cases of non-linear 
relationship, such as that found in the present problem. 
Since the term coefficient of correlation and the symbol r 
refer only to cases of linear regression, we may term this 
general measure the index of correlation, and use the symbol 
p (Rho) to represent it.? 

1 This symbol has been employed by Spearman to represent the coefficient of 
correlation when based upon the squares of differences in rank. This usage has no 


apparent significance in ordinary economic analysis, however, and confusion is not 
likely to arise from the employment of the symbol for the present purpose. 


NON-LINEAR CORRELATION 437 


As a general formula for the index of correlation we 
have! . 
S 2 
Pyz=1— ca 
The value of S, has been derived above. The value of 
g,, computed by familiar methods, is found to be 2.27. 


Substituting in the formula for p we have 


= 1.8406 
Ai V1 PB a177 
=~ 90 


This value is materially greater than that of the coeffi- 
cient of correlation for the same data. The value of r 
is +.68. The difference is due to the fact that the second 
degree parabola constitutes a much better fit to the data 
than the straight line. The correlation is distinctly non- 
linear, and r is an inappropriate measure of correlation. 


Tue MEANING OF THE INDEX OF CORRELATION 


It is important that the significance and the limitations 
of p be understood. Its value depends upon the relation 
between the scatter about the fitted line and the scatter 
about the arithmetic mean of the Y’s. In the case of a 
straight line, p and r are identical, r being a special case 
of p. The limits of p are 0 and 1, a value of 0 indicating 
that if there is a relationship between the two variables it 
cannot be described by the particular equation employed. 
A value of 1 indicates that the relationship, as described 
by the equation employed, is a perfect one. For curves 


1 With X dependent this formula becomes 


The first of the two subscripts refers always to the dependent variable, the second 
to the independent. It is essential that these be shown, for p would not necessarily 
be the same with X dependent as with Y dependent. Such a distinction is not 
necessary in the case of linear correlation, for r is the same no matter whicb 
variable be dependent. 


4388 THE MEASUREMENT OF RELATIONSHIP 


of higher degree no positive or negative sign should be 
attached to p, for the relationship might be positive over 
part of the range and negative over other parts, as in the 
alfalfa example given above. 

The index of correlation, p, has no significance unless the 
type of curve to which it applies be named in each case. 
The meaning of r in this respect is always clear, for it is 
understood that it relates always to a straight line, but 
confusion would arise in the case of p unless the type of 
curve were specifically mentioned. The index of correla- 
tion may be looked upon as a measure of the adequacy of 
a curve of a given type to describe the relationship between 
two variables. 

It is, of course, always possible to secure a curve which 
will pass through any number of points if the constants in 
the equation be equal to the number of points. In such 
a case p would, of necessity, be equal to 1, but this value 
would have no significance. In any employment of mathe- 
matical functions there is this limit of absurdity, when the 
number of constants is equal to the number of points, and 
p would merely reflect this absurdity. The ordinary prin- 
ciples of curve fitting must be kept in mind in using such an 
index as this. It must never be taken to have an absolute 
significance, standing by itself. Its significance is always 
relative, referring to the particular function employed. 
This fact, which is true of every measure of correlation, 
is frequently overlooked, and invalid and fallacious con- 
clusions reached as a result. . 


A Snort Metnop or ComputinGc THE INDEX OF 
CORRELATION 


The standard error and the index of correlation were 
computed by a rather laborious method in the above 
example, in order that there might be no misunderstanding 
of their precise meaning. The burden of calculation may 


NON-LINEAR CORRELATION 439 


be materially reduced, however, by taking advantage of © 
the relationships which were disclosed in dealing with r. 
For a curve of the potential series 


Ve= a OXe- Xe aXe 2, 


the formula for S, is derived by a simple extension of that 
employed in the case of the straight line. As a general 
formula for a series of this type, we have 


S2- 2(Y¥*) — aX(Y) — b2 (XY) — c2(X*Y) — dz(X*Y) -— ... 
Cf ae eet Pee ce Nr a ky eam Gy Vek Sg 


Similarly, the formula for r may be extended to give a 
general formula for p applicable to any equation of this 
general type. This formula! is 


Me oly el (XY ch (X7Y) + dee) & .. . — No? 
es Cea a Lae ae 7 


In the special case in which the origin is at the mean of 
the Y’s, 2(y) = 0 and c, = 0, and the formula reduces to 


2 _ b&(Xy) + c2(X?y) + dz (X*y) +. 
a zy) 


The characteristics of the formulas for S and p should 
be noted. The only values required in securing these 
measures are the constants in the equation which describes 
the average relationship, certain values which have been 
used in the process of fitting and, in addition, 2(Y?) and 
c,.. Thus, as direct by-products of the fitting process, 
we have the values of S and p, the two measures which are 
needed to supplement the regression equation in securing 
a complete description of the relationship between the two 
variables in question. The equation describes the average 
relationship. The standard error, S, is a measure of the 
reliability of estimates based upon this equation, and p is 
' an abstract index of the degree of relationship, in so far as 


1 See Appendix A for a discussion of the derivation of this formula. 


440 THE MEASUREMENT OF RELATIONSHIP 


that relationship can be described by the particular curve 
employed. 

The application of these formulas may be illustrated 
with reference to the problem of alfalfa yield. The fol- 
lowing values, derived from the data of Table 109 and from 
the fitting process, are required for this purpose: 


a= 3.5468 D(X2Y) = 407,448.00 
b= .2520 ¢,2 = 55.9504 
c = — 0028162 
D(Y) = 329.03 S(¥2) = 2688.3129 
S(XY) = 10,269.96 N= 44 


Substituting in the formula for the standard error for a 
second degree parabola, 


DP?) —a(VY)-— DAY) ay 


S,? = V 
we have 
we 2688.3129 — (3.5468 x 329.03) — (.2520 x 10,269.96) 
a 44 
— ( — .0028162 x 407,448) 
44 
_ 80.7345 
AA 
= 1.8349 
Sy, = 1.36 


The index of correlation, for a curve of this type, is 
computed from the equation 


oy, = C2) + BE(XY) + 0B (X*¥) — Noy 


ve SW? — Ney? 
Substituting the appropriate values, we have 
145 .'7608 


Se es cal as Be ie Se IANA oe 
P'vz ~ 9688. 3129 — (44 x 55.9504) 
and 
[Dyn tell 
These methods of deriving S and p are applicable over 
a wide field by a simple adaptation of the formulas to the 


NON-LINEAR CORRELATION 441 


particular equations that may be employed in given 
instances. Further illustrations are given in the succeeding 
chapter, while this general method is expiained in more 
detail in Appendix A. ; 


THE CoRRELATION Ratio 


A third distinctive measure of correlation remains to 
be described. This is the correlation ratio, devised by 
Karl Pearson and represented by the symbol 7 (Eta). 
This measure may be looked upon as a special case of p, 
but somewhat different methods are employed in_ its 
computation. 

We have seen that in all cases the degree of relationship 
between two variables, as described by a curve of a given 
type, may be determined from the formula 


Measure of correlation = a oe 
y 
The coefficient of correlation, 7, is just such a measure, 
when S, represents the standard deviation about a straight 
line. The index of correlation, p, is a general measure of 
the same type. The correlation ratio is precisely the same 
sort of measure, S, in this case representing the standard 
deviation about a line passing through the mean of every 
column in the correlation table. We have, in effect, in- 
creased the number of constants in the equation of the 
curve to be fitted until the number is equal to the number 
of columns. If the means of all the columns lie on a straight 
line, the correlation ratio and the coefficient of correlation 
will be equal. If the means of the columns do not lie on a 
straight line, the correlation ratio will be greater than the 
coefficient of correlation. 

No new principle is involved, therefore, in the concept of 
the correlation ratio. It is employed when the regression 
is non-linear. It measures the degree of relationship 
between two variables, in so far as this relationship may be 


442 THE MEASUREMENT OF RELATIONSHIP 


described by a curve passing through the mean of every 
column. If the relationship is perfect, if there is no scatter 
about the curve fitted in this way, 7 will have a value of 1. 
If there is no relationship, if the scatter about the curve 
is as great as the dispersion about the mean of the Y’s, 
will have a value of zero. 

The formula generally employed in the computation of 
the correlation ratio differs somewhat from that given 
above. To represent the standard deviation about the 
line joining the means of the columns, the symbol ou 
is employed, instead of S,. Its meaning is precisely the 
same as that of S,, as employed above, except that aay 
refers always to a correlation table. 

The formula may be written 


2 

Nv = V1 ea 
When eta is written as above (n,:) it refers to the regres- 
sion of Y on X (Y dependent). When it is written 72, it 
refers to the regression of X on Y (X dependent), and its 
value depends upon the scatter about a line joining the 
means of the rows. Unlike r, which has the same value 
for both regressions, n,2 and 72, will have different values 

unless the regression be linear. 


THE COMPUTATION OF THE CORRELATION RATIO 


The following correlation table shows the general rela- 
tion between the amount of nitrogen, in pounds per acre, 
used as fertilizer in certain agricultural experiments, and 
the corresponding yield of wheat, in bushels per acre.} 
The points are plotted in Fig. 80 on page 444. 

1 This table is based upon experiments described by E. Davenport (“‘Compara- 
tive Agriculture” in Bailey’s Cyclopedia of American Agriculture). The actual 
figures used have been arbitrarily chosen for the purpose of the present illustration, 


but Davenport’s experiments have demonstrated the existence of a law similar 
to the one here assumed. 


NON-LINEAR CORRELATION 443 


TABLE 111 
Correlation Table showing the Relation between Wheat Yield per Acre 
and amount of Nitrogen used as Fertilizer 


X — Nitrogen applied in pounds per acre 


O— | 20— |40— | 60— | 80— |100— }120— |140-— |160-— 
19.9) 39.9 |59.9} 79.9 | 99.9 |119 .9}139 .9]159 9/179 .9 


s 
§ 
8 
§ 
a 
3 
LS 
<= 
) 
= 
i) 
& 
2 
3 
eS 
-S 
> 
8 
2 
= 
> 


For the computation of 7, by the formula given above 
we need the values of o, and ow, the latter being the 
root-mean-square deviation about the line joining the 
means of the various columns. The former value may be 


444 THE MEASUREMENT OF RELATIONSHIP 


obtained readily by methods already familiar. It is possible 
to compute the quantity oa by the method first employed 
in calculating S,, that is, by measuring and squaring the 
deviations of the individual points from the line of regres- 
sion. In the present case, however, the line describing 
the relationship passes through the mean of each column, 


« 


hence these means may be used in place of the “‘normal”’ 


36 


Wheat Yield in Bushels per Acre 


0 2 40 6 80 100 120 140 160 180 
Nitrogen Applied in Pounds per Acre 


Fic. 80. — Scatter Diagram Showing the Relation between Wheat Yield and 
Nitrogen applied as Fertilizer, with Straight Line of Regression and Line 
joining the Means of the Columns 


values as computed from an equation of regression. In 
computing ¢,,, therefore, the deviations of the individual 
items from the means of the various columns are squared, 
added, the mean determined and the square root extracted, 
just as in the computation of the standard deviation. 
Part of the procedure may be illustrated by using the 


NON-LINEAR CORRELATION 445 


data in the first column of Table 111. This column con- 
tains all items having X-values between 0 and 20. The 
- mean Y-value of the 21 items falling in this column is 
5.05; deviations are measured from this value. 


TaBLE 112 
Computation of the Squares of the Deviations about the Mean of an Array 
Class-Interval Deviation from 
(wheat yield in mean of column 
bu. per acre) (5.05) 
m Si d OP fa? 
8—11.9 10 3 4.95 24.5025 73.5075 
4— 7.9 6 10 95 . 9025 9.0250 
O0— 3.9 2 8 —3.05 9.3025 74.4200 
Total 156 .9525 


The sum of the squared deviations is obtained for each 
of the other columns in a similar fashion. The standard 
deviation about the means of all the columns, g,,, is found 
to have a value of 2.420. The value of a, is 9.188. 

Substituting the given values in the formula 


Onan 
Co es 
(2.42)? 
phe a cei y Gee 
Wve = 1 — To Fe8p 
= 1-— .0694 
= .9306 
Nye = .965 


This is the value of the correlation ratio, measuring the 
degree of scatter about a line running through the means 
of the columns. Its significance is discussed below. 

The method of calculation employed in the preceding 
example may be materially shortened. Let g,,, represent 
the standard deviation of the means of the various columns 
about the arithmetic mean of all the Y’s, In computing 


446 THE MEASUREMENT OF RELATIONSHIP 

this value the mean of each column is weighted by the 

number of items in that column. It may be shown! that 
Cay? = O57 — Omy? 


Substituting for o.,? in the equation 


2 
Wyz=1- = 
we secure 
than 1 (“ee2) 
ye oF 
One: 
aie 
tis = 
Oy 


Since om, may be much more easily determined than gay 
the value of 7 is generally computed from this formula. 
The data of Table 111 may be used to exemplify the process. 


1 The following proof is adapted from Yule. 

Given a series with mean M made up of two component series with means M, 
and M2. N, the total number of observations, is equal to N; + Ne, the sum of the 
observations in the two component series. What is the relation between o, 01 and 
oo? If we let 

M. i— M = Cy ; 
then for S,*, the mean-square deviation of the observations in the first of the two 
component series, measured from M as origin, we have 

S? = o) + ¢? 
Similarly 

S22 = 02? + 
But N,S,? is equal to the sum of the squares of the deviations, about M, of the items 
in the first of the component series, and N2S2? is equal to the sum of the squares 
of the deviations, about M, of the items in the second of the two component series. 
Therefore 

NS? + N2S2? 


oz 


N 
and i No? = N,S8,2 + N2S22 (1) 
But S? =o +c? and S,? = o,? + ¢;? 
therefore No? = Ni(o,? + c12) + No(o2? + 2”) 2) 


In the present case we have the major series with mean represented by M,, 
and a number of component series (the items arranged by columns) with means 
represented by my, etc. Let Say represent the standard deviation of any column 


NON-LINEAR CORRELATION 447 


TaBLe 113 
Illustrating the Computation of the Correlation Ratio 
tend Mean value _—_ Deviation 
of Y-items from Mean Square of Fre- 


Sd a tn column of all Y's Deviation quency 
(pounds) (bushels) (25 .005) 
My d d? i fi? 
10 5.05 —19.955 398 . 202 21 8,362 242 
30 15.12 —9.885 97.713 25 = 2,442 825 
50 24.40 — .605 . 366 30 10.980 
70 28.73 +3 .725 13.876 4h 610.544 
90 31.73 +6 .725 45 .226 37 ~—s- 1,673. .362 
110 32.40 +7 .395 54.686 20 1,093 .720 
130 32.00 +6 .995 48 .930 8 391.440 
150 33.33 +8 .325 69 .306 6 415 .836 
170 34.00 +8 .995 80.910 2 161.820 
Total 193 15,162.769 
Rigalt /15,162.769 
i 193 
= 8.864 


of Y’s about the mean of that column. Then we have a number of component 
series, with standard deviations Sq,,, etc., and with means differing from the 
mean of all the Y’s by M, — my, ete. ‘Substituting in equation (2) derived 
above, we have 


No;? = n,[Say:? + (My — my)? + r[Sarv? + (My - my)? J +... (3) 
No,? = Zn[Say? + (My — my)?] (4) 
But Noay? = X(n- Say?) 
Dd? 
for, in each column, Sa,” = seal 


since d represents a deviation from the mean of that column. Then, for all 


columns, 
La Ln-Sa,? 


Sm N 
Substituting in equation (4) 
No,? = Noay’ + 2n(M, - my)" (5) 
By definition of the standard deviation of the means of the columns 
a 2n(M, — my)? 


omy ac N 


Therefore, from (5), Oy? = Cay? + my? (6) 


448 THE MEASUREMENT OF RELATIONSHIP 


Substituting the given values in the formula 


Nyz = a 
we have 
_ 8.864 
Nys = 9.188 
= .965 


The process of computing the correlation ratio may be 

briefly summarized: 

1. Arrange the items in the form of a correlation table. 

2. Find the arithmetic mean of all the Y-items in each column. 
(i.e. Find the arithmetic mean of each Y-array of 
type X.) 

3. Compute the arithmetic mean of all the Y’s. 

4. Measure the deviation of the mean of each column from 
the mean of all the Y’s. Square each of these deviations 
and multiply by the number of items in the given column. 
Get the sum of the squared deviations. 

5. Divide this sum by the total number of items and extract 
the square root of the result. This gives the value of 
Omg: 

6. Compute dy. 

7%. Divide Onmy by o, The quotient is yz. 

The value of the correlation ratio of X on Y may be 

similarly computed, substituting the proper values in the 
formula 


Or 


The symbol om: represents the standard deviation of the 
means of the various rows about the mean of all the X’s. 
The value of the correlation ratio of X on Y depends upon 
the amount of scatter (horizontally) about the line joining 
the means of the rows. Its value will generally be different 
from that of the correlation ratio of Y on X. In the present 
case the value of 7, is found to be .824. As the line of 
relationship approaches the linear form the two correlation 
ratios approach identity, 


NON-LINEAR CORRELATION 449 


Like r, 7 can never exceed 1, this value being secured 
when there is no dispersion about the line joining the 
means of the columns (or rows). From the formula 


it is evident that the value of the correlation ratio is zero 
when om, is zero. This is the case when the mean of each 
column has the same value as the mean of all the Y’s. 
Such a condition is found when an increase or decrease in 
the value of the X-variable brings no corresponding change 
in the value of the Y-variable. This means that in each 
column of the correlation table there is a distribution of 
cases similar to the general distribution of Y’s. When 
this is true there is clearly no relation between the two 
variables. 

The correlation ratio, it should be noted, never has a 
negative value. It is possible to determine by inspection 
of the correlation table, however, whether the relation 
between two variables is direct, or inverse, or a varying one. 

The coefficient of correlation has one distinct advantage, 
as compared with the correlation ratio, in that when its 
value and the values of the two standard deviations are 
known the equations to the lines of regression may be 
readily determined. This is not true of 7. To get a 
quantitative expression for the “law” of relationship be- 
tween two variables, when » has been computed, an addi-: 
tional calculation for the purpose of fitting a curve to the 
means of the arrays would be necessary. 


CORRECTION OF THE CORRELATION RatTIo 


The use of 7 is only possible when the data are numerous, 
and can be arranged in the form of a correlation table. If 
a limited number of items should be so arranged, and it 
chanced that there was but one item in each column, the 
two measures om, and o, would be identical and y would 


450 THE MEASUREMENT OF RELATIONSHIP 


necessarily have a value of 1. Computed from a very small 
number of cases and employing a large number of classes, 
the correlation ratio would be meaningless. 

Karl Pearson has suggested a correction for the raw 
correlation ratio to offset, in particular, errors due to the 
employment of too fine a grouping (i.e. too many classes) 
for the data. If we represent by x the number of arrays, 
the desired value is secured from the formula 

(x — 3) 

Bed aa ss 

(x — 3) 
Pen 


Corrected 7? = 


Applying this correction to the results secured in the 
preceding example, we have 


(.965)? — os 
oie 
Corrected 7? = (9 ~ 3) 
eis 
= .929 


Corrected 7 = .964 


The correction is very slight in the present case, but if 
N were small or « very large it would reduce the given 
value materially. 


RELATION BETWEEN THE CORRELATION RATIO AND THE 
COEFFICIENT OF CORRELATION 


When the relation between two variables is absolutely 
linear the line running through the means of the columns 
corresponds, of course, to the line upon which the coeffi- 
cient of correlation is based. When this is the case n and 
r have the same value. As the relationship between the 
two variables departs from the linear form the values 
secured for y and r differ, 7 being always greater than r. 
This results from the fact that the scatter about a line 
joining the means of the columns will always be less than 


NON-LINEAR CORRELATION 451] 


the scatter about a straight line fitted to these points, 
except when the straight line passes through every mean 
point. And the less the scatter about the line expressing 
the average relationship the greater the value of the 
measure of correlation. Thus for the alfalfa problem it 
was found that 7 has a value of + .68, and that an index 
of correlation based upon a second degree parabola has a 
value of .80. The correlation ratio for the same material 
is .83. For the data of Table 111 the value of y,, was 
found to be .964; the value of 7 is + .793, the difference 
between the two being marked. The reason for the differ- 
ence is found in Fig. 80, in which the straight line of re- 
gression of Y on X and the line joining the means of the 
columns are shown. The regression departs materially 
from linearity, and the scatter about the straight line of 
regression is much greater than the scatter about the line 
joining the means. 

The relation between r and y affords a convenient test 
of linearity in a given instance, since the two values will be 
identical when the regression is strictly linear, and will 
differ the more widely the greater the departure from the 
linear form. The general test for linearity is 


ae a 


Even in a case of linear regression it is probable that 7 
and r will differ somewhat because of fluctuations due to 
chance alone. A material difference, as reflected in the 
magnitude ¢ (Zeta), indicates that a straight line does not 
describe the relationship in question and that r is not a 
suitable measure of correlation. 

In the example given above, in which 7 equals .964 and 
r equals .793, the measure ¢ has a value of .300. This 
is large enough to indicate conclusively that the regression 
is non-linear. In a later section! the method of determining 
the significance of a given value of ¢ is explained. 

1 Chapter XVI, 


452 THE MEASUREMENT OF RELATIONSHIP 


REFERENCES 
Bowtey, A. L. Elements of Statistics (865-367). 
Keuury, Truman L. Statistical Method (238-245). 
Peart, Raymonp. Medical Biometry and Statistics (311-318). 
Pearson, Karu. Mathematical Contributions to the Theory of 
Evolution (XIV). On the General Theory of Skew Correlation 
and Non-Linear Regression. Draper’s Company Research 
Memoirs, Biometric Series II. 1905. 
Notes on the History of Correlation. Biometrika, Vol. 13, 
1920 (25-45). 
On a Correction Needful in the Case of the Correlation Ratio. 
Biometrika, Vol. 8, 1911 (254-256). 
On the Correction Necessary for the Correlation Ratio. Bio- 
metrika, Vol. 14, 1923 (412-417). 
Rievz, H. L. (editor) Handbook of Mathematical Statistics (129-131). 
Yue, G. U. An Introduction to the Theory of Statistics (204-207). 


CHAPTER XIII 


THE MEASUREMENT OF RELATIONSHIP AND 
THE PROBLEM OF ESTIMATION 


It is no great exaggeration to say that quantitative 
method in economics and business centers about the prob- 
lem of estimation. Equations of regression, measures 
of standard error and coefficients of correlation are of 
interest largely because of their bearing upon the practical 
problems of determining probable production, probable 
price, probable business changes. It should not be under- 
stood from this that the problem of estimation relates 
only to attempts to forecast future changes. We make an 
estimate whenever we seek to determine the most probable 
value from a number of different observations, or whenever 
we employ an equation which describes the relation be- 
tween two or more variables. The value of statistical 
technique rests in large part upon its practical utility in the 
making of estimates. 

This object has been definitely to the fore in the pre- 
ceding chapters, which dealt with methods by which the 
value of one variable might be estimated from a given 
value of another. We may, at this point, briefly summarize 
certain assumptions upon which the validity of this method 
rests. 


Somgeé ASSUMPTIONS INVOLVED IN THE MAKING OF 
ESTIMATES 


In earlier chapters it has been pointed out that the most 
probable value of a series of observations is their arith- 
metic mean. Given a normal distribution about the mean, 
the standard deviation affords an exact measure of the 


probabilities involved in basing estimates upon the mean. 
453 


454 THE MEASUREMENT OF RELATIONSHIP 


Similarly, the standard error of estimate affords an exact 
measure of the probabilities involved in basing estimates 
upon an equation of regression, again upon the assumption 
that the distribution about the line of regression is normal. 
The significance and usefulness of the equation of regres- 
sion may be determined by comparing the standard error 
of estimate of a given variable with the standard deviation. 

From the relation between these two values, moreover, 
an abstract measure of relationship, the coefficient or 
index of correlation, may be computed. This coefficient, 
or index, is a thoroughly valid and accurate measure only 
if the distribution about the line of regression and the 
distribution about the mean are normal, or approximately 
so. Pronounced departures from the normal type lessen 
the significance of these measures. 

In the foregoing discussion we have been concerned with 
arithmetic values throughout. In speaking of estimates 
based upon the mean we referred to the arithmetic mean. 
The distributions about the mean and about the line of 
regression are assumed to be normal when deviations are 
measured arithmetically. The standard deviation and the 
standard error of estimate are in arithmetic terms, referring 
to absolute values. But may we assume that all the 
distributions we deal with in economic analysis are of the 
arithmetic type? Should estimates be made and errors 
of estimate measured only in arithmetic terms? If they 
should not be so limited, are the methods developed above 
capable of adaptation to other distributions? These ques- 
tions may best be answered in terms of a specific problem. 


A PROBLEM OF ESTIMATION 


In Table 114 the production and price of oats in the 
United States from 1881 to 1913 are recorded. Appro- 
priate lines of trend were fitted to these series and the 
ratios of the actual values of the items in each series to 
the trend values determined. 


THE PROBLEM OF ESTIMATION 455 
TABLE 114 

Production and Price of Oats in the United States 

Production 4 ; Price o ‘ 

of oats in Seraagit Ratio of oats a Straight Ratio of 
Year U.S. line trend aotwal PTO- | Chicago | line trend actual 

(millions of hares duction to (cents | of price |. Price to 

of bu.) tron trend value per bu.) trend value 

1881 416 448 .929 47 36.0 1.30 
1882 488 471 1.036 37 35.3 1.05 
1883 571 494 1.156 31 34.6 .90 
1884 583 517 1.128 29 34.0 .85 
1885 629 540 1.165 28 33.2 .84 
1886 624 563 1.108 25 32.5 erhids 
1887 659 586 1.124 30 31.2 .96 
1888 701 609 eo Q4 30.5 .79 
1889 751 632 1.188 2A 29.8 81 
1890 523 655 .798 43, 29.0 1.48 
1891 738 678 1.088 31 28 .3 1.10 
1892 661 701 .943 30 27.5 1.09 
1893 639 724 .882 31 26.8 les 
1894 662 TAT .886 28 26.1 120% 
1895 824 770 1.070 19 25 .3 75 
1896 780 793 .983 18 23 .6 -76 
1897 791 816 . 969 24 25 .0 .96 
1898 843 839 1.005 25 26 . 4 .95 
1899 926 862 1.074 23 27.8 .83 
1900 914 885 1.033 25 29.2 .86 
1901 778 908 857 42 30.6 1.37 
1902 1053 931 Pers 33 32.0 1.03 
1903 869 954 .911 38 33 .4 1.14 
1904 1009 977 1.033 30 34.8 .86 
1905 1090 1000 1.090 3h 36.2 .86 
1906 1036 1023 1.013 39 37.6 1.04 
1907 805 1046 .770 51 39.0 1.31 
1908 851 1069 .796 52 40.4 1.29 
1909 1068 1092 .978 43 41.8 1.03 
1910 1186 1115 1.064 35 43 .2 .81 
1911 922 1138 .810 51 44.6 1.14 
1912 1418 1161 1.221 37 46.0 .80 
1913 1122 1184 948 41 AT A .87 


1 This line of trend was fitted to data covering a longer period than that in- 
cluded in the present study. 
2 The entire period has been broken into two parts, 1881 to 1895 and 1896 to 
1913. A straight line of trend was fitted by H. B. Killough to the data of each 


period. 


456 THE MEASUREMENT OF RELATIONSHIP 


It is desired to measure the relation between these two 
variables. A hyperbolic curve of the general type Y = aX’ 
appears to be an appropriate form to employ in describing 
such a relationship. To fit this curve by the method of 
least squares, the equation must be reduced to the loga- 
rithmic form 

log Y = loga+ b log X 
The normal equations required in fitting a curve of this 
type, are ; 
I (log Y) = N loga+b Z(log X) 
Il X(log X-log Y) = log aX (log X) + bz (log? X) 
The values necessary for the solution of these equations 


are determined from Table 115. 
From this table we have 


IN "33 
D(log Y) = — .32849 Z(log X. log Y) = — .1143005 
D(log X) = .037535 (log? X) = .096423 


Substituting in the normal equations, we secure 
— .32849 = 33 log a+ .037535b 


— .1143005 = .037535 log a + .096423b 


Solving 
log a = — .00861 


b = — 1.18206 


The required equation is 
log Y = (9.99139 — 10) — 1.18206 log X 
or 
Y = .9804X — 118206 

This is the equation which describes the average rela- 
tionship between the production and the price of oats 
(when the actual figures for each are expressed as ratios 
to the respective lines of trend). The corresponding curve 
is plotted in Fig. 82. 

1 T am indebted to Mr. H. B. Killough of the Bureau of Agricultural Economics 


for permission to use the data presented in Tables 114 and 115. The figures are 
taken from his comprehensive study of the factors affecting oat prices. 


THE PROBLEM OF ESTIMATION 457 


TABLE 115 


Computation of Values Required in Fitting a Curve to Data of Oat 
Production and Prices 


Exampre I 
(1) (2) (3) (4) (5) (6) (7) (8) 
Rati Ratio 
atro of pro- 
Year of price duction to log Y log X log2 Y log2X log Y.log X 
to trend Seen 
ren 
% Xx 

1881 1.30 . 929 . 1139434 -9680157—1 | .01298310 | .001022995 | —.0036444 
1882 1.05 1.036 . 0211893 . 0153598 . 00044899 | . 000235923 - 0003255 
1883 90 1.156 . 9542425 —1 . 0629578 . 00209375 | .003963685 | —.0028808 
1884 85 1.128 . 9294189 —1 . 0523091 . 00498169 | .002736242 | —.0036920 
1885 84 1.165 . 9242793 —1 . 0663259 - 00573362 | .004399125 | —. 0050222 
1886 vif 1.108 . 8864907 —1 . 0445398 - 01288436 | 001983794 | —.0050557 
1887 96 1.124 . 9822712 —1 . 0507663 - 00031431 | .002577217 | —.0009600 
1888 79 1.151 . 8976271 —1 . 0610753 . 01048021 | .003730192 | —. 0062524 
1889 -81 1.188 . 9084850 —-1 . 0748164 . 00837500 | .005597494 | —. 0068468 
1890 1.48 .798 . 1702617 . 9020029 —1 . 02898905 009603432 | — .0166852 
1891 1.10 1.088 . 0413927 . 0366289 . 00171336 | .001341676 . 0015162 
1892 1.09 - 943 . 0874265 -9745117—1 | .00140074 | .000649653 | —. 0009539 
1893 1.16 . 882 . 0644580 - 9454686 —1 | .00415483 | 002973674 | —.0035150 
1894 1.07 . 886 . 0293838 . 9474337 —1 | .00086341 | .002763216 | —.0015446 
1895 15 1.070 . 8750613 —1 . 0293838 . 01560968 | .000863408 | —.0036712 
1896 76 . 983 . 8808136 —1 . 9925535 —1 | .01420540 | . 000055450 . 0008875 
1897 . 96 . 969 . 9822712 —1 . 9863238 —1 | .00031431 | .000187038 - 0002425 
1898 95 1.005 . 9777236 —1 . 0021661 - 00049624 | .000004692 | —. 0000483 
1899 .83 1.074 . 9190781 —1 . 0310043 . 00654835 | .000961267 | —.0025089 
1900 . 86 1.033 . 9344985 —1 . 0141003 - 00429045 | .000198818 | —.0009236 
1901 1.37 857 . 13867206 . 9329808—1 | .01725316 | .004491573 | —. 0091629 
1902 1.03 1.131 . 0128372 . 0534626 . 00016479 | .002858250 . 0006863 
1903 1.14 .911 . 0569049 -9595184—1 | .00323817 | .001638760 | —.0023036 
1904 . 86 1.033 . 9344985 —1 - 0141003 . 00429045 | .000198818 | —. 0009236 
1905 . 86 1.090 . 9344985 —1 . 0374265 . 00429045 | .001400743 | —. 0024515 
1906 1.04 1.013 . 0170333 . 0056094 . 00029013 | ,000031465 . 0000955 
1907 1.31 770 . 1172713 . 8864907 —1 | .01375256 | .012884361 | —.0133113 
1908 1.29 . 796 . 1105897 . 9009131 —1 | .01223008 | .009818214 | —.0109580 
1909 1.03 978 . 0128372 - 9903389 —1 | .00016479 | .000093337 | —. 0001240 
1910 .81 1.064 . 9084850 —1 . 0269416 . 00837500 | .000725850 | —.0024656 
1911 1.14 .810 . 0569049 . 9084850 —1 | .00323817 | .008374812 | —.0052076 
1912 .80 1.221 . 9030900 —1 . 0867157 . 00939155 | .007519613 | —.0084036 
1913 .87 - 948 . 9395193 —1 . 9768083 —1 . 00365792 | . 000537855 . 0014027 
Totals | 32.83 | 33.338 |17. 6715068 —18|14. 0375350—14] . 21721807 | .096422642 | —. 1194567 
+. 0051562 


| —. 1143005 


Tue StanparpD Error or ESTIMATE IN 
Loaaritumic TERMS 
How reliable is this equation? With what degree of 


confidence may estimates be based upon it? To answer 
these questions we must compute the standard error, S. 


458 THE MEASUREMENT OF RELATIONSHIP 


Since the fitting process was carried through in terms of 
logarithms, the standard error may be computed in the 
same terms. Following the procedure explained in earlier 
sections with reference to the straight line and the potential 
series, we may derive the following equation relating to the 
logarithmic curve just fitted: 


meee (log? Y) — log ad (log Y) — b2 (log X- log VY) 
logy = N 


Substituting the proper values, we have 


hog y 
21721807 — ( — .00861 x — .32849) — ( — 1.18206 x — .1143005) 
eee 
07927928 
“J 33 


Siog y = .0024024 


The standard error of estimate, in the form of a loga- 
rithm, is .04901. As long as we deal with logarithms, 
this is to be interpreted precisely as is the standard error 
with respect to other curves. Assuming a normal distri- 
bution of logarithms about the curve which describes the 
average relationship, the chances are 68 out of 100 that 
the logarithm of a given estimate will not differ from the 
logarithm of the actual value by more than .04901, 95 out 
of 100 that the logarithm of the given estimate will not 
differ from the logarithm of the actual value by more than 
.09802, and 99.7 out of 100 that the logarithm of the 
given estimate will not differ from the logarithm of the 
actual value by more than .14703. 


INTERPRETATION OF THE STANDARD ERRoR oF ESTIMATE; 
ZONES OF ESTIMATE 


What does this mean in terms of actual values? It 
means, simply, that we are dealing throughout in terms 
of ratios instead of absolute figures. The difference between 


THE PROBLEM OF ESTIMATION 459 


the logarithms of two numbers is the logarithm of the 
ratio of one of the original numbers to the other. Thus 
the absolute value of S in a given case will depend upon 
the magnitude of the values with which we are dealing. 
If the user desires to reduce S to absolute values, it must 
be done always with reference to a given estimate. That 
is, a given value of X is substituted in the equation of 
average relationship and the corresponding value of Y 
estimated. If the logarithmic equation is used, this estimate 
will be in the form of a logarithm. To the logarithm of 
the estimate add the value of S,,,,. The anti-logarithm of 
the number thus secured will give the upper limit of a zone 
extending a distance equal to S above the line of regres- 
sion. From the logarithm of the estimate subtract the value 
of S,,,- The anti-logarithm of the number thus secured 
will give the lower limit of a zone extending a distance 
equal to S below the line of regression. The odds are 68 
out of 100 that the value of Y in the given case will fall 
within the limits thus marked out. The absolute limits 
corresponding to 2S and 3S may be similarly determined. 
The zone thus marked out with respect to a logarithmic 
curve will differ materially from the similar zones already 
described in dealing with simple linear equations. In the 
simple case a zone extending 18 on each side of the esti- 
mating curve has the same absolute width throughout 
its length, and is centered always at the line of regression. 
The logarithmic zone, when measured in natural numbers, 
is of varying width, and, moreover, is not of the same 
width on each side of the plotted curve. It is true, how- 
ever, that the ratios on the two sides of the curve are 
always equal. That is, the ratio of a value 18 less than the 
computed value to the computed value is the same as the 
ratio of the latter to a value 1S greater. And when the 
curves are plotted on paper ruled logarithmically, the zone 
included within a distance 1S on each side of the plotted 
curve takes the symmetrical form found in the earlier and 


460 THE MEASUREMENT OF RELATIONSHIP 


simpler cases. A person accustomed to thinking in terms of 
ratios and to the use of logarithmic paper can readily 
interpret this measure. 


Tue STANDARD Error oF ESTIMATE IN TERMS 
oF RaTIos 


Since the ratios are equal throughout, the standard error 
of estimate may be expressed in ratio terms. In the present 
example we have 


S, = anti-log Siogy = anti-log .04901 = 1.12 


where S, is used to represent the standard error of estimate 
in terms of ratios. Sj,,,, as derived above, is positive, 
hence the ratio exceeds unity. It is the ratio of the larger 
number to the smaller. What does it mean? It means 
that in 68 cases out of 100 the actual value, if it exceed 
the estimate, will not exceed it by more than 12 per cent, 
and, if it fall below the estimate, will stay within a limit 
such that the estimate will not be more than 12 per cent 
greater than the actual value. This is not a convenient 
form, since this ratio always expresses the larger value in 
terms of the smaller value. It would be more convenient 
to have it always in terms of a percentage of the estimate. 
This may be done by putting S,,,, in negative terms, and 
getting the corresponding natural value. The value 
— .04901 = 9.95099 — 10, which is the logarithm of .8933.. 
In this form the ratio is based upon the relation of the 
smaller to the larger number. To make S, readily in- 
telligible we may combine the two, writing 


S, = .89 to 1.12 


Interpreting this, it means that, given a normal distribution, 
in 68 cases out of 100 the actual value will not be less than 
89 per cent of the estimate, or more than 112 per cent of 
the estimate. This has a simple, definite meaning more 


THE PROBLEM OF ESTIMATION 461 


significant for most practical purposes than a similar 
measure in terms of absolute values.! 

To find the values of 2S or 3S these percentage figures 
may not be simply multiplied by 2 or 3. The value of 
Sig, must be so multiplied, and the resulting values 
reduced to natural numbers. For convenience in use, the 
anti-logarithms of both the positive and negative values 
should be secured, as in the preceding case. The computa- 


tions are simple. 
QWSiog y = .09802 


The anti-logarithm of this value, when considered positive 
is 1.25, when negative .80. 
3Stogy = . 14703 


The corresponding anti-logarithms are 1.40 and .71. Sum- 
marizing for the standard error, we have 
S, = .89 to 1.12 
28, = .80 to 1.25 
3S, = .71 to 1.40 


The values given for S, indicate the probable percentage 
limits within which actual value and estimated value 
should fall in 68 out of 100 cases. The values given for 
28, indicate the probable percentage limits in 95 out of 
100 cases. The values of 38, indicate the probable per- 
centage limits in 99.7 cases out of 100, always on the as- 
sumption of a normal distribution of the logarithms of the 
actual values about the fitted curve. 


APPLICATION OF THE STANDARD Error oF ESTIMATE 


We may illustrate the use of §,,,,.. Given a production 
of oats 50 per cent above the trend value (i.e. the ratio to 


1 The significance of a measure of reliability in percentage form was pointed 
out by D. H. Davenport in 1922, in an unpublished article, and such a measure has 
been employed in several studies. There has not been available, however, a ready 
method of computing this measure, and its possibilities have not, therefore, been 
fully realized. 


462 THE MEASUREMENT OF RELATIONSHIP 


trend is 1.50), what is the most probable accompanying 
price ratio and what is the degree of accuracy of this 
estimate? 

The estimating equation is 


log Y = (9.99139 — 10) — 1.18206 log X. 


Substituting in this equation the value .176091 (the 
logarithm of 1.50) we secure for log Y the value 9.78324 
—10. The corresponding natural number is .607. This 
means that if production is 150 per cent of normal (as 
measured by the given line of trend) price will probably be 
60.7 per cent of normal (as measured by the line of trend). 

To determine the reliability of this estimate, the standard 
error must be secured. Employing the values of S, already 
computed we find that 54 is 89 per cent of 60.7, while 
68 is 112 per cent of 60.7. We interpret these figures to 
mean that in 68 cases out of 100 the actual price prevailing 
under the given production conditions will not be less 
than 54 per cent of the normal or trend value nor more 
than 68 per cent of normal.! Corresponding values for 
2S, and 3S, may be determined in the manner outlined 
above. 


Tue INDEX oF CORRELATION BASED ON LOGARITHMIC 
VALUES 


We have still to compute the third measure, the abstract 
index of correlation.2. For an equation of the type 


log Y = loga+ b log X 
the formula for p reduces to 


1 A question arises at once as to the adequacy of the given lines of trend in 
thus determining the reliability of estimates in the present problem. This question 
is discussed in greater detail in another section. 

 'The symbol p is used for this measure of correlation, instead of r, even though 
the relationship in logarithmic form is linear. This is done because such a measure, 
in terms of logarithms, cannot be interpreted in precisely the same way as the 
ordinary coefficient of correlation. 


THE PROBLEM OF ESTIMATION 463 


; _ log a (log Y) + b2 (log X-log VY) — Neg y 
p logy logr > (log? Y) a Neg y 
where ¢;,,, represents the difference between the arith- 
metic mean of the logarithms: of the Y-values and the 
origin (in this case, zero on the logarithmic scale). Sub- 
stituting the proper values we have 


Dicey log x 
= (—.00861 x —.32849) + (—1.18206 x —.1143005) — (33 x.00009909) 
cm 21721807 — (33 x .00009909) 


. 13466882 
. 2139481 
= .629445 

= .793 


Plog y log « 

The index of correlation has a value of .793. How is 

this to be interpreted when we are dealing with logarithms 
as in the present case? 

Its significance may be clearer if viewed in terms of the 


relationship 
02 
2 _ 7 _ Stesy 
p log y log x Oe i 


In the present case these values are 


Sicey = 04901 
log y — .08052 


When these values are squared and inserted in the above 


formula we have 
002402417 


= 1 00648328 


Pine, log x 
and 
Prog y log « = .793 
What does this value measure? We have seen that r 
and the more general index p are abstract measures of the 
degree of relationship between two variables, as this re- 
lationship is described by given functions. The value of 


p in a given case depends upon the variability about the 


464 THE MEASUREMENT OF RELATIONSHIP 


fitted line, in relation to the variability about the mean 
of the Y’s. If the variability of estimates is materially 
reduced when the equation of regression is used as a basis 
for estimates, instead of the mean Y, the equation may be 
assumed to describe a significant relationship. The value 
of p depends thus upon the relation between the two 
quantities, S, and oy. 

In the cases dealt with in the preceding chapter the 
variability in each case was measured in terms of absolute 
deviations, and the value of p depended upon the relation 
between the two given measures of absolute variability. 
The sole difference in the present case is that we are working 
in terms of logarithmic or ratio variability, deviations being 
measured in terms of logarithms instead of natural numbers. 

The index p must be interpreted in the light of this fact. 
Its value, as always, depends upon the relation between 
two measures of variability, S? and o, but in the present 
instance these are expressed in terms of logarithms. In 
brief, the value of p depends upon the relation between 
the ratio variability about the fitted curve and the rato 
variability about the geometric mean of the Y’s. (It is 
the geometric mean of the Y’s, because that is the value 
corresponding to the arithmetic mean of the Y logarithms.) 

We have here a set of measures, therefore, which perform 
in the field of ratios precisely the same service as is per- 
formed in the field of natural numbers by S and p (in the 
linear case, 7). These measures are secured in the same 
way as are S and p, except that the equation of relation- 
ship from which they are derived is one in which the 
dependent variable is log Y (or, in the reverse case, log X). 
The general formulas for computing these values are the 
same as in dealing with natural numbers, except that 
log Y replaces Y throughout. The operation is analogous 
to that of using logarithmic paper instead of natural scale 
paper. 

It should be noted that the values are in logarithmic or 


THE PROBLEM OF ESTIMATION 465 


ratio form if Y is expressed logarithmically, whether Y 
be so expressed or not. Thus we have fitted a curve of 
the type 

log Y = loga+b log X 
the logarithmic form of the ordinary parabola or hyperbola. 
The values S and p would also be in logarithmic form if 
the curve were of the type 


log Y = loga+ X log b 
the logarithmic form of the exponential 
Y = a(b*) 


In each of these cases the logarithmic equation is linear, 
but this is not essential to the use of these measures. S 
and p are generally applicable measures, whether ratios or 
natural numbers be dealt with, and whether the functions 
be linear or otherwise. 

It may be well at this point to summarize the symbols 
that have been used and to distinguish the different 
measures. We may employ the symbols S,, o, and p 
when arithmetic relations are in question, the two former 
being measures of variation in absolute terms, and the 
index p-referring to degree of relationship when natural 
numbers are employed. If the logarithms of the Y’s are 
used it is advisable to distinguish the symbols by sub- 
scripts, using S,,,, and o\,,, aS measures of the logarithmic 
variation about the fitted curve and about the arithmetic 
mean of the logarithms of the Y’s, respectively. If S),,, 
is reduced to ratio form, it may be written S,. Since the 
index p must be interpreted somewhat differently in this 
case, it may be written Piog y tog 2» OF Plogy 2° 


Tue Use or REcIPROCALS IN THE MEASUREMENT OF 
RELATIONSHIP 


Another type of curve may be used to describe the 
relationship between the production and price of oats, and 


466 THE MEASUREMENT OF RELATIONSHIP 


its use introduces us to a third field of correlation, a field 
in which somewhat new concepts enter, and in which the 
various measures must be interpreted in still another way. 
This is a curve of the type 


1 


a TED c 


which may be expanded by adding additional terms to 
the denominator, as 
i 
~ a+oX +cX? 


This hyperbolic form has been used in several studies as 
an approximation to a “‘demand”’ curve for various com- 
modities. 

The equation to a curve of this type may be written 


+ =a+ bx 

which is the equation to a straight line describing the 
relationship between the reciprocals of the Y’s and the 
original X values. The normal equations required in 
fitting a curve of this type are 


I 27) = Na + b3(X) 
I 2(F) = oD(X) + bE(X)? 


The method of computing the necessary values is illustrated 
in Table 116. 
Substituting the proper values in the normal equations, 
we have 
34 .3360320 = 33a + 33.338b 
35 .2571485 = 33.338a + 34.168554b 
Solving, 
a= — .1357 
b = 1.1643 


THE PROBLEM OF ESTIMATION 


TaBLeE 116 
Computation of Values Required in Fitting a Curve to Data of Oat 


Production and Prices 


467 


Example IT 
a) | @ | @) (4) (6) (6) (7) 
; Produc- 

Year Prie tion 

Rate Ratio 25 x A? 

MA xX Ve Y i; xX? 

1881 1.30 .929 . 7692308 - 7146154 .59171602 .863041 
1882 1.05 1.036 .9523810 . 9866667 . 90702957 1.073296 
1883 90 1.156 i a 1 2844444 1 . 23456788 1.336336 
1884 85 1.128 1.1764706 1.3270588 1 .38408307 1.272384 
1885 84 1.165 1.1904762 1.3869048 1 .41723358 1.357225 
1886 77 1.108 1.2987013 1.4389610 1 .68662507 1.227664 
1887 96 1.124 1.0416667 1.1708334 1 .08506951 1.263376 
1888 79 1.15] 1. 2658228 1.4569620 1 .60230736 1.324801 
1889 .81 1.188 1. 2345679 1 .4666667 1 .52415790 1.411344 
1890 1.48 .798 .6756757 .5391892 45653765 .636804 
1891 1.10 1.088 .9090909 . 9890909 . 82644626 1.183744 
1892 1.09 .943 .9174312 .8651376 .84168001 .889249 
1893 1.16 .882 . 8620690 - 7603449 -74316296 . T7779 24 
1894 1.07 .886 9345794 . 8280373 . 87343865 . 784996 
1895 he 1.070 1. 3333333 1 .4266666 1 .77777769 1.144900 
1896 BG, 983 1.3157895 1.2934211 1 .73130201 - 966289 
1897 .96 969 1.0416667 1 .0093750 1.08506951 . 938961 
1898 .95 1.005 1.0526316 1.0578948 1.10803329 1.010025 
1899 .83 1.074 1.2048193 1.2939759 1.45158955 1.153476 
1900 .86 1.033 1.1627907 1.2011628 1.35208221 1.067089 
1901 VEST 857 . 7299270 .6255480 . 53279343 -T34449 
1902 1.03 List .9708738 1.0980583 - 94259594 1.279161 
1903 1.14 911 .8771930 .7991228 . 76946756 .829921 
1904 .86 1.033 1.1627907 1.2011628 1.35208221 1.067089 
1905 .86 1.090 1.1627907 1. 2674419 135208221 1.188100 
1906 1.04 1.013 .9615385 . 9740385 . 92455629 1.026169 
1907 rol .770 . 7633588 .5877863 .58271666 .592900 
1908 1.29 .796 .7751938 -6170543 - 60092543 .633616 
1909 1.03 .978 . 9708738 . 9495146 . 94259594 . 956484 
1910 81 1.064 1.2345679 1.3135802 1.52415790 1.132096 
1911 1.14 810 .8771930 .7105263 . 76946756 .656100 
1912 .80 15221 1.2500000 1.5262500 1.56250000 1.490841 
1913 .87 948 1.1494253 1.0896552 1.32117852 .898704 

$2.83 | 33.338 | 34.3360320 | 35.2571485 | 36.85702940 | 34.168554 


The desired equation is, therefore, 


1 


Y 


= — .1357+ 1.1643X 


468 THE MEASUREMENT OF RELATIONSHIP 


Tue STANDARD ERROR AND THE INDEX OF CORRELATION 
IN Terms oF RECIPROCALS 


To determine the utility of this equation we must have 
the standard error and the index of correlation. The two 
necessary formulas may be derived as in the preceding 


1 
cases. Representing by y the reciprocal of an actual 


value we have, for each residual, 


(oe rier 


Multiplying by d and summing 


S(d) = aZ(d) + bE (dx) — >(F) 


Since 
2(d) = 0 and 2(dX) = 0, 
we have 


d 
2(@) = - 2(5) 
Multiplying the residual equation now by , and sum- 


ming, we have 
d 1 X 1? 
2 (p)- (7) +2 (7) -2(y) 


ate : Nas, 
Substituting the equivalent of > (5) in the preceding ' 


equation, we secure 


BP) = 3 (F) - ad (F) - 82 (F) 


2 
and for Si: , we have 
y 


THE PROBLEM OF ESTIMATION 469 


Inserting this value of Su in the general formula for the 


index of correlation 


ee 
if 3; (F)- Ne? 


y 
Inserting the proper values in these two equations, we find 
that 


Si = .1191 
y 

pi, = .766 
y 


For the standard deviation of the original Y-values, in 
terms of reciprocals, we secure 


(pil roti | 
y 


ash : : : 
(The subscript 73 used in connection with each of these 


measures, as they should be distinguished from measures 
based upon natural numbers or logarithms.) 


INTERPRETATION OF THE STANDARD ERROR OF 
ESTIMATE 


How may we interpret these results? As in all former 
problems of this type the equation gives us a means of 
estimating Y from a known value of X. The standard 
error Si serves as a measure of the reliability of such 

y 


estimates, and Pp; , 18 an abstract measure of the degree of 


y 
relationship between the two variables. But in the present 
case all these measures are in terms of reciprocals. The 
equation enables us to estimate the reciprocal of Y, the 


470 THE MEASUREMENT OF RELATIONSHIP 


standard error has significance only in the form of a re- 

ciprocal, and the value of p depends upon the relation 

between two measures (Si? and a1”) both of which are in 
y y 


terms of reciprocals. 

An illustration may make these meanings clear. If, in a 
given year, the production of oats is 150 per cent of normal, 
what is the most probable price? Substituting in the 
equation 


¢- — ,1357 + 1.1643X 
a value of 1.50 for X, we have 
i 
y= 1.6108 
and 
= .621 


We may expect a price approximately 62 per cent of 
normal. As a measure of the reliability of this estimate, 


we have 


Si = .1191 


a 
y 
This must be applied to the estimate in terms of reciprocals. 


Thus we have 
1.6108 + .1191 = 1.7299 
1.6108 — .1191 = 1.4917 


Reducing these reciprocals to natural numbers we secure 
.578 and .670 as the desired values. The most probable. 
price, then, is 62.1 per cent of normal, and, on the assump- 
tion of an approximately normal distribution of reciprocals 
about the curve, the odds are 68 out of 100 that the price 
will fall between 57.8 per cent of normal and 67.0 per cent 
of normal. The limits of 2S and 3S may be similarly de- 
termined by adding to and subtracting from the estimate, 
as a reciprocal, amounts equal to twice .1191 and three 
times .1191. The results secured may then be converted 
to natural numbers. Just as with logarithms, the value 


THE PROBLEM OF ESTIMATION 471 


in absolute terms of a given difference between reciprocals 
varies at different points within the range of Y-values. 


Accordingly, the limits of reliability determined from S1 
y 


should be expressed in natural numbers only after a par- 
ticular estimate has been made. 


A ComMPARISON OF MraAsuRES oF RELATIONSHIP 
DERIVED FROM ARITHMETIC, GEOMETRIC 
AND Harmonic MEaAsuRES 


In interpreting p similar considerations enter. The value 
of the index of correlation, as we have seen, depends upon 
the degree of variation about the curve, as compared with 
the variation about the average of the original dependent 
series. In handling natural numbers, variability about 
the fitted line is compared with the variability about the 
arithmetic mean of the dependent variable, both measured 
in absolute terms (i.e. S, is compared with a,). In handling 
logarithms, variability about the fitted line is compared 
with variability about the arithmetic mean of the loga- 
rithms of the dependent series, variability being measured 
in each case in terms of logarithms. But logarithmic 
deviations, as we have seen, may be interpreted in terms 
of ratios. The logarithmic deviations from the line repre- 
sent the ratios of actual values to computed, while loga- 
rithmic deviations about the arithmetic mean of the 
logarithms of the original series represent the ratios of the 
actual values of the dependent series to their geometric 
mean. The value of p,,.,, depends upon the relation between 
these respective deviations (i.e. S,,,, is compared with o;,, ,). 

In fitting a curve in which the reciprocals of the de- 
pendent variable are employed, variability about the fitted 
line is measured in terms of reciprocals, and the variability 
of the original series is measured in the same terms. That 
is, o1 is computed from the differences between the recip- 

y 


472 THE MEASUREMENT OF RELATIONSHIP 


rocals of the actual values and the arithmetic mean of all 
these reciprocals. But the arithmetic mean of these 
reciprocals is the reciprocal of the harmonic mean. ‘Thus, 
in short, the value of the index of correlation, p1, depends 


y 
upon the relation between variability about the fitted line 
and variability about the harmonic mean of the dependent 
series, variation in both cases being measured in terms of 
reciprocals (i.e. S 1 is compared with 71). 


We have, eheretare: three broad feuulice of curves for 
describing the relationship between variable quantities. 
These are: 


1. Curves in the fitting of which natural values of the depen- 
dent variable are employed. Equations to all curves of 
this family will be of the type 

Y = f(X) 


2. Curves in the fitting of which logarithms of the dependent 
variable are employed. In all such cases the equations 
will be of the type 

log Y = f(X) 


3. Curves in the fitting of which reciprocals of the dependent 
variable are employed. For these curves the equations 
will be of the type 


yp = f(X) 


In any one of these three cases the equations may be 
linear or non-linear. In so far as this problem of interpre-_ 
tation is concerned, there is no limitation as to the function 
of X which may be employed. (The computation of S 
and p by the methods suggested above involves certain 
limitations which are outlined elsewhere.) 

The standard error of estimate for the first family of 
curves is derived in terms of the original units of measure- 
ment (for the dependent variable) and has a direct and 
simple meaning in these terms. The index of correlation, 


THE PROBLEM OF ESTIMATION 473 


for curves of this type, is a measure of the degree to which 
the absolute variability of the dependent variable may be 
lessened by measuring deviations from the fitted curve 
instead of from the arithmetic mean. 

The standard error of estimate for the second family of 
curves is derived, by the method outlined, in terms of 
logarithms. It is more convenient in general to give it 
meaning in terms of ratios. The index of correlation, p,,,,; 
is a measure of the degree to which the logarithmic or ratio 
variability of the dependent variable may be lessened by 
computing deviations (or ratios) with the fitted curve 
instead of the geometric mean as base. 

The standard error of estimate for the third family of 
curves is derived by the same process as in the other cases, 
but emerges as a reciprocal. The index of correlation, p:, is 


y 
a measure of the degree to which the variability of the 
dependent variable, in terms of reciprocals, may be lessened 
by computing reciprocal deviations from the fitted curve 
instead of from the harmonic mean. 


Factors GovERNING THE CHOICE OF MEASURES OF 
RELATIONSHIP 


It is clear, therefore, that the choice of a type of curve 
to describe a given relationship must be governed by basic 
considerations as to the type of average which is most 
appropriate as a measure of the central tendency of the 
given series. And this brings in a related question as to 
whether the dispersion about this average more nearly 
approximates the normal type when measured in absolute 
terms, in logarithms, or in reciprocals. In selecting a 
curve and in using the measures S and p there is always 
present an implicit assumption with respect to these points. 

When absolute values are important, and the dispersion 
of the dependent variable approaches the normal type when 
plotted on an arithmetic scale, measures of relationship of 


474 THE MEASUREMENT OF RELATIONSHIP 


the arithmetic type would appear to be appropriate. . But, 
as we have seen, in handling series in which rates of change 
rather than absolute amounts of change are of primary im- 
portance and the dispersion appears to follow a geometric 
law, the arithmetic mean and other arithmetic measures are 
notoriously inadequate. In such cases logarithmic curves 
seem preferable to arithmetic, and measures of the reli- 
ability of estimates and of degree of relationship which are 
based upon ratios seem to be more suitable than those 
based upon absolute values. 

The harmonic mean has not been so widely employed as 
either of the above averages, and some attention may be 
given to principles governing its use in problems of the 
type here considered. In general, such harmonic measures 
are marked by the same weaknesses as the arithmetic, 
except that they err in the opposite direction. Geometric 
measures are perhaps better adapted to all-around em- 
ployment than either. Yet in one particular field of interest 
to the economist the harmonic mean is particularly ap- 
propriate, and the utilization of reciprocals, as in the 
preceding example, seems to be justified. 

The use of the harmonic mean assumes a normal distri- 
bution of reciprocals which, in natural numbers, means a 
much wider scatter above the average than below. The 
use of a curve of the type 


1 
prat bx 


involves a similar assumption as to the relation between 
Y and X. A given absolute increase in X will be accom- 
panied by a certain decrease in the value of Y. The same 
absolute decrease in X will be accompanied by an increase 
in the value of Y which is larger than the decrease registered 
in the preceding case. But this is the relation which pre- 
vails, for many commodities, between the amounts pro- 
duced and the price, the latter considered dependent. A 
given increase in production will cause some lowering of 


THE PROBLEM OF ESTIMATION 475 


price. An equal decrease will cause a much greater increase 
in price. Moreover, when averaging the prices of such 
commodities over a period, the harmonic mean may give 
a more typical value than any other average.! In such 
cases there is a strong a priori justification for using a 
curve of the reciprocal type and measuring the accuracy 
of all estimates in terms of harmonic relations. 


A CoMPARISON OF ARITHMETIC, GEOMETRIC AND HARMONIC 
MEASURES 


The contrast between these different methods may be 
brought home most effectively by comparing the results 
obtained when curves of these three types are fitted to the 


1 “Buyers and sellers of potatoes are frequently mistaken as to the price 
justified by fundamental economic conditions. If such an error is general in the 
fall, it may happen, for example, that the price which results is too high. If the 
price is too high in the early part of the season, potatoes will not be consumed fast 
enough to dispose of the supply available. Farmers and dealers will then find 
that not all of the stocks on hand can be sold at existing prices. Since potatoes 
can not be carried over from one year to the next, the price, under such conditions 
as have been mentioned, must be lowered enough to permit the supply to be 
disposed of before the end of the season. A properly adjusted price would remain 
the same throughout the season, except for a gradual advance to cover cost of 
storage, and would maintain a fairly uniform consumption throughout the season. 
But since an abnormally high price early in the season causes small consumption, 
it must be compensated by an abnormally low price during the remainder of the 
season, or not all the crop can be sold. 

Similarly, if the price is abnormally low early in the season, the supply will be 
exhausted too rapidly and those who still have potatoes will find that they can get 
abnormally high prices for them during the remainder of the season.” 

But how, given the abnormally high or abnormally low prices during part of a 
season, may we compute the average price which would be justified by the true 
conditions of demand and supply, if these had been correctly estimated? Since 
‘*a low price during part of a season will be compensated only by a disproportion- 
ately high price during the remainder of the season”’ the arithmetic average for 
an entire season “will be somewhat higher than the average which would have 
resulted had a proper price been established at the beginning of the season. This 
difficulty is eliminated by taking the harmonic mean of the monthly prices.” 

Holbrook Working, Factors Determining the Price of Potatoes in St. Paul and 
Minneapolis. Technical Bulletin 10, University of Minnesota Agricultural Experi- 
ment Station, 8-10. 


476 THE MEASUREMENT OF RELATIONSHIP 


same data. The computations involved in &tting curves 
of the second and third types (logarithmic and reciprocal) 
have been illustrated with reference to the data of oat 
production and prices (Table 114). A straight line (arith- 
metic) is fitted to the same data, and the necessary accom- 
panying measures computed. The three sets of results are 
brought together in the following table: 


TaBLE 117 


Relation Between the Production and Price of Oats 
1881-1913 
Comparison of Results of Curve Fitting 
(Prices are the dependent variable in each case) 


Equation Standard Error Index of 
of Estimate Correlation 
A Y = 2.24 — 1.236X Sin SS ole T= 1895 
B us = — .1357 + 1.1643X Se — nO pase 166 
Ve y y 
C Log Y = — .00861 — 1.18206 log X Sjogy = .04901 Piszy = 793 


It is impossible to compare the three standard errors as 
they stand, since only the first one is in the original units 
of measurement (ratio of actual yield to normal). In the 
following table are given estimates, based on each of these 
equations, as to the most probable price (in terms of ratio 
to normal) which would accompany each of five different 
conditions of production. Each estimate is accompanied 
by a series of values which indicate the limits set by the, 
standard error. Throughout, the values of the estimates 
plus and minus S, 2S, and 3S are given, in order to indicate 
the probable scatter of actual values about the estimates. 
The different amounts of variation which may be expected 
about each of the three lines of relationship are measured 
by the actual differences between the estimates and the 
limiting cases. These differences are given in the columns 
headed A. All values in this table are comparable, being 
reduced to the original units (ratio of price to normal). 


THE PROBLEM OF ESTIMATION 


TABLE 118 


Comparison of Price Estimates and of Standard Errors of Estimate Based 
on Three Equations Relating to the Production and Price of Oats 


ATT 


(1) (2) 
Value| Estimated 
of X |\value of ¥ 
(ratio | (ratio of 
of price to 
pro- | normal) 
duc- | from 
tion to|arithmetic 
nor- | equation 
mal) (A) 
.5 1.622 
.8 1.261 
1.0 1.004 
yA - 757 
1.5 . 386 


(3) 


(4) 


(5) 


Estimated 
value of ¥ 
from 
rectprocal 
equation 
(B) 


2.240 


(6) 


Limits of 
estimate, 
reciprocal 


(7) 


A 


(8) 


Estimated 
value of Y 
from 
logarith- 
mic 
equation 


(C) 


2.224 


(9) 


Limits of 
logarithmic 
estimate 


-114 
. 780 
491 
979 


(10) 


A 


+.890 
+. 556 
+. 267 
— 245 


Limits of 
arithmetic A 
estimate 
+35 =1.982)+.36 
+25 =1.862|+.24 
+S =1.742/+.12 
—S=1.502}—.12 
—2S =1. 382] —.24 
—3S =1. 262] —.36 
+3S =1.611/+.36 
+25 =1.491)/+.24 
+S =1.371/+.12 
—S=1.131|/—.12 
—2 =1.011| —. 24 
—3S= .891|—.36 
+358 =1.364|+.36 
+2S =1.244/+.24 
+5 =1.124)+.12 
—S= .884/—.12 
—2S= .764|—.24 
—3S= .644|—.36 
+35 =1.117|+. 36 
+2S= .997/+.24 
+S= .877/+.12 
—S= .637|/—.12 
—2S= .517|/—.24 
—3S= .397/—.36 
+35 =.'746|-+.36 
+25 = .626)+. 24 
+S =.506]-+.12 
—S=.266] —.12 
—2%S =. 146] —. 24 
—35 =. 026] —.36 


1.257 


-972 


- 793 


- 621 


. 490 
- 265 
. 100 
871 
. 789 
. 722 


. 106 
.977 
875 
124 
- 667 
- 618 


. 798 
728 
- 670 
578 
541 
. 508 


1.276 


- 980 


. 790 


. 607 


- 852) +. 245 
-761|+.154 
- 680] +. 073 
- 542) —.065 
484) —. 128 
- 433] —.174 


ZONES OF ESTIMATE AND THEIR SIGNIFICANCE 


A careful study of this table should make clear the nature 
of estimates based on the three types of equations here 


478 THE MEASUREMENT OF RELATIONSHIP 


presented. The fundamental differences lie not so much 
in the actual values of the estimates, as in the standard 
errors which measure the reliability of these estimates and 
indicate the limits within which the actual values are likely 
to fall. In other words, the differences lie in the assumptions 
made as to the character of the scatter about the curves. 

The measure S,, which relates to the arithmetic curve, 
gives the same absolute range to errors of estimate whether 
the estimated value be high or low. An arithmetic dis- 
persion about the curve is assumed. Im each case the 
estimate is the arithmetic mean of the value which exceeds 
the estimate by an amount equal to S, (or any multiple 
of S,) and the value which falls below it by an equal amount. 
These conditions are brought out graphically in Fig. 81. 
The original points are plotted, the straight line of relation- 
ship (arithmetic) is shown, and zones of estimate having 
widths, respectively, of 2S, 4S and 6S, centering at the 
fitted line, are marked out. 

The measure S,,,, gives the same relative or percentage 
range to errors of estimate, whether the estimate be high 
or low. This means that the absolute range within which 
the actual values should fall is much less when the estimates 
are low than when they are high. It assumes a geometric 
dispersion about the curve which describes the relationship. 
The estimate is, in this case, the geometric mean of the 
value which exceeds it by an amount equal to S,,, (or 
any multiple of Sj, ,) and the value which falls below it 
by an equal amount. Fig. 82 presents these relationships 
graphically. The original data are here plotted, together 
with the graph of the equation 


Y = .9804.X —1-18206 


There are shown, also, the limits of zones of estimate having 
widths equal, respectively, to 2Sr, 4Sr and 6Sr, centering 
(geometrically) at the line of relationship. A comparison 
of Fig. 81 and Fig. 82 will perhaps give a better understand- 


4 


THE PROBLEM OF ESTIMATION 479 


ing of the differences between estimates based on the 
assumption of an arithmetic distribution and those based 
on the assumption of a geometric distribution. 


ON AG@REERE 
RSNNEBEI 


” Production | Ratio 


Fic. 81. — The Relation between the Production and Price of Oats; Illustrating 
the Use of an Arithmetic Equation of Regression and Arithmetic Zones of 
Estimate 


480 THE MEASUREMENT OF RELATIONSHIP 


The points and lines shown in Fig. 82 are plotted on a 
logarithmic scale in Fig. 83. On this scale the curve of 
relationship becomes straight, and the zones of estimate 


Mme Ne: 
CCAWCE EL 
COANE 
CCAS ET 


2.0 


18 


-_- 


1.6 


1.4 
INS 
| | lies 
ae) 
Zo . aN 
y N 
: eee 


° 


Fic. 82, — The Relation between the Production and Price of Oats; Illustrating 
the Use of a Logarithmic Equation of Regression and Geometric Zones of 
Estimate 


THE PROBLEM OF ESTIMATION 481 


appear as symmetrical and of equal width throughout the 
range. ‘This transformation when the data are plotted on 
logarithmic paper makes clear the fundamental simplicity 


Set (aed 
ime 


EOL 
ee PrGaliction Ratio 
Fig. 83. — The Relation between the Production and Price of Oats; Illustrating 
the Use of a Logarithmic Equation of Regression and Geometric Zones of 
Estimate (Plotted on Double Logarithmic Paper) 


of the assumptions involved in making estimates from 
logarithmic values. 
In using the measure Si we carry still further the assump- 
y 


tion that the variability about the curve is greater with 


482 THE MEASUREMENT OF RELATIONSHIP 


high prices than with low. It shows a very limited range to 
errors of estimate when the estimate is low and a very 
wide range when the estimated price is high. A harmonic 
dispersion about the curve is assumed. The computed 
value, or estimate, is always the harmonic mean of the 
value which exceeds it by an amount equal to S1 (or any 


multiple of S1) and the value which falls below it by an 
equal amount. 


In Fig. 84 the curve ; = — .1357+ 1.1643 X is plotted, 


together with the original observations. Zones of estimate 
with widths of 2Si, 4S1 and 681, centering (harmonically) 
y y y 


at the fitted line, are shown. The differences between this 
figure and each of the two preceding are quite marked, 
particularly with respect to the zones of estimate. On the 
assumption of a normal harmonic distribution about the 
curve describing the relationship, the outer zone (with width 
equal to 6S) marks the limits within which 99.7 per cent 
of all the points should fall, and the inner zone (with width 
equal to 2S) marks the limits within which 68 per cent of 
all the points should fall. By plotting reciprocals through- 
out, instead of natural numbers, this apparently abnormal 
distribution could be reduced to the symmetrical form 
secured in plotting the geometric values on the logarithmic 
chart. 

For both high and low estimates the geometric measure, 
Siog y, Stands between the arithmetic measure, S,, and the 
harmonic measure, Si, While the two latter have their 


particular functions, and are appropriate in certain cases, 
it is probably true that in using such methods as these in 
economic analysis, measures of the geometric family are 
more generally useful than those of the other types. This 
means, merely, that ratios are usually more important 
than absolute differences. It seems reasonable therefore 
to base estimates upon an equation of the type 


THE PROBLEM OF ESTIMATION 483 
Log Y = f(X) 


and to measure the reliability of these estimates in terms 
of ratios, using Sigy. In such cases, as we have seen, 


Price Ratio 


" 


Production Ratio 
Fic. 84. — The Relation between the Production and Price of Oats; Illustrating 


the Use of an Equation of Regression Based upon Reciprocals and of Harmonic 
Zones of Estimate 


484 THE MEASUREMENT OF RELATIONSHIP 


correlation is measured by pig,, the value of which de- 
pends upon the ratio variability about the curve as com- 
pared with the ratio variability about the geometric mean.* 


REFERENCES 


Kittoues, H. B. A Statistical Analysis of Oat Prices. A 
manuscript prepared for the U.S. Bureau of Agricultural 
Economics. 

Moors, H. L. Elasticity of Demand and Flexibility of Prices. 
Journal of the American Statistical Association, March, 1922. 
Empirical Laws of Demand and Supply and the Flexibility 
of Prices. Political Science Quarterly, Dec., 1919. 

Watsu, C. M. The Problem of Estimation (1-67). 


Examples of curves employed to describe relationships of the 
type discussed in this chapter will be found in 


Scuuttz, Henry. The Statistical Measurement of the Elasticity of 
Demand for Beef. Journal of Farm Economics, July, 1924 
(254-278). 

Workine, Horproox. Factors Determining the Price of Potatoes 
in St. Paul and Minneapolis. Technical Bulletin 10, Uni- 
versity of Minnesota Agricultural Experiment Station. 


1 The reasoning in C. M. Walsh’s book, The Problem of Estimation, (London, 
King, 1921) is peculiarly applicable to the present problem. Citing Galileo, 
in defence of the use of the geometric mean in averaging estimates, Walsh 
writes: “And so errors must be measured by an error which is a ratio between 
the estemate and the true quantity, and not a concrete quantity itself. We cannot 
measure errors by so many pounds, feet or crowns; we must measure them 
by the proportions of the pounds, feet or crowns in the erroneous estimates to 
the pounds, feet or crowns in the thing estimated.’ (Italics mine.) (p. 12.) This 
argument bears out powerfully what has been said as to the use of logarithmic © 
functions in estimating, and as to the employment of logarithmic measures of 
errors of estimate. 


CHAPTER XIV 


THE MEASUREMENT OF RELATIONSHIP: MULTIPLE 
AND PARTIAL CORRELATION 


In dealing with the problem of correlation in the pre- 
ceding chapters we have been concerned with problems 
involving only two variables, a dependent variable and a 
single independent variable. We have found, in certain 
cases, a fairly high degree of correlation between the two 
variables studied. But it is obvious that, in general, 
economic phenomena are affected by more than one factor, 
that the fluctuations in a single variable may be due to 
the interaction of many forces. In dealing with just two 
variables all other factors are ignored, on the assumption, 
usually, that in the single independent variable are found 
the most important causes’ of fluctuations in the dependent 
variable. Thus, in the altalta “example given, the effect 
upon yield of but a single factor, irrigation, was studied. 
Yet variations in rainfall and temperature must have 
affected the yield in the different years studied. Similarly, 
variations in practically every factor dealt with in economic 
analysis are traceable to more than one cause. If our 
analysis is to be complete we must employ methods which 
will enable more than two variables to be handled at a 
time. We need instruments which will assist us in measuring 
the combined effect upon a single variable of a number 
of factors. Such instruments may be secured by a simple 
extension of methods already familiar. 

In the following table are presented figures showing the 
yield of corn, per acre, in Kansas from 1890 to 1922, 

1 This should not be taken to mean that the coefficient of correlation measures 


or establishes causal relationships. Cf. Chapter XVI on this point. 
485 


486 THE MEASUREMENT OF RELATIONSHIP 


TaBLeE 119 
Corn Yield and Temperature in Kansas! 1890-1922 


(1) (2) (3) (4) (5) (6) (7) 

Actual Trend Se Average Average | Average 

Yield per |Value, Yreld on hak June July August 

Fea acre, in | per acre, in POLE Tempera- | Tempera- Tempera- 
bushels bushels ? of Trend ture ture ture 

Value 

»¢ 1 X: 2 X: 3 Xx, 4 
1890 15.6 22.4 69 .6 77.6 83.1 view | 
1891 26.7 22.2 120.3 70.7 74.0 ool 
1892 24.5 22.1 110.9 73.4 Wr hes 76.5 
1893 91.3 21.9 97.3 74.7 19) 73.8 
1894 Ta Ee 21.8 51.4 74.2 Chi fexe: 78.0 
1895 24.3 21.6 112.5 aes 74.9 76.0 
1896 28 .0 2125 130.2 74.1 78.1 Chante 
1897 18.0 2s 84.5 76.6 80.2 76.0 
1898 16.0 Value Gs Wono 75.0 CARE 78.2 
1899 27.0 21.0 128 .6 73.9 76.2 80.6 
1900 19.0 20.9 90.9 74.9 717.9 81.0 
1901 7.8 20.7 Sel, 77.3 85.0 79.1 
1902 29.9 20.6 145.1 70.9 76.8 78.2 
1903 25 .6 20.4 125.5 67.2 78.3 75.8 
1904 20.9 20.3 103.0 70.4 75.6 74.6 
1905 ape le 20.1 137.8 tone 74.5 78.7 
1906 28.9 20.0 144.5 71.8 73.8 76.3 
1907 22.1 19.8 111.6 72.0 78 4 78.1 
1908 22.0 19.7 aN re thse 75.8 16,2 
1909 19.9 19.5 102.1 om 78.1 80.1 
1910 19.0 19.4 97.9 U2 2 79.5 Vol 
1911 14.5 19.2 75.5 80.5 78.6 76.4 
1912 23.0 19.1 120.4 69.3 79.9 77 A 
1913 Sie 18.9 16.9 74.2 82.1 84.2 
1914 18.5 18.8 98.4 its iors 79.9 78.2 
1915 31.0 18.6 166.7 69.2 74.0 70.1 
1916 10.0 18.5 54.1 70.3 81.2 79.6 
1917 13.0 18.3 (A 72.8 80.8 73.4 
1918 Bon 18.2 39.0 78.4 78.3 S23 
1919 15.2 18.0 84.4 72.3 80.2 78.3 
1920 26.5 17.9 148.0 72.8 77.6 72.9 
1921 22.2 alae 125 .4 TAA 79.2 78.6 
1922 19.3 17.6 109.7 75.2 CHP 80.1 


1 The data of corn yield are from Bulletin 515, U.S.D.A., and from the 
Yearbooks of the U.S.D.A, Temperature data are from reports of the U.S. 
Weather Bureau. 

* The equation to the line of trend is Y = 19.967 — .1502X, with origin at 1906. 


MULTIPLE AND PARTIAL CORRELATION 487 


together with the average June, July, and August tempera- 
tures for each of these years. A straight line of trend has 
been fitted to the yield figures, and the actual yield is 
given, in column (4), in percentages of normal, as measured 
by the line of trend. 


THE RELATION BETWEEN CORN YIELD AND TEMPERATURE; 
PRELIMINARY ANALYSIS 


It is known that corn yield is affected by the temperature 
during the growing season. The object of the present 
study is the determination of the precise relation between 
yield and temperature during each of the three months 
given, in order to secure a basis for estimating the yield 
from a knowledge of the temperature. As certain growing 
months are more important than others, the relation 
between temperature and yield may be determined, first, 
for each of the three months separately. 

The equation which describes the relationship between 
yield per acre and June temperature will be of the type 


X,= a+ bypXe 


The equation describing the relationship between yield 
per acre and July temperature will be of the type 


X,= a+ b3X3 


(In each case X, represents the percentage relation of 
actual to trend value, while X2, X3, etc., represent the 
absolute temperature, in degrees Fahrenheit.) Instead of 
using to represent the variables the symbols Y and X, 
as in the preceding examples, X,, X2, X3, etc., are em- 
ployed, X, representing in this case the dependent variable. 
The symbol for the constant representing the slope (the 
coefficient of regression) is, in the first instance above, bi». 
The subscripts 1 and 2 indicate the variables to which this 
constant refers, the first subscript always representing the 
dependent variable (X, in the first example above), the 


488 THE MEASUREMENT OF RELATIONSHIP 


second the independent variable (X2 in the present case). 
These subscripts are necessary to distinguish the different 
constants when several variables enter into the problem. 
The meaning is precisely the same as in the former examples 
when no subscripts were needed because only two variables 
were dealt with. 

Solving the proper normal equations for the constants 
in the equation which describes the average relationship 
between yield per acre (percentage relation of actual to 
trend) and June temperature, we have 

X, = 522.31 — 5.743.Xe 


The value of Sj. may be determined from the formula 


2 2(X,2) = ax (X,) = bio) (X 1X2) 
S12 = 
N 

(The subscripts to S, and those to r which appear below, 
have the same meaning as those employed in the preceding 
paragraph.) Substituting the given values, and solving, 
we have 

Sio? = 913.178 


and Sie = 30.22 


The significance of the standard error, S, as a measure of 
the reliability of estimates based upon the equation of 
relationship, has been fully explained. In judging of the 
usefulness of the equation, Sj. should be compared with 61 
(the standard deviation of X,) which may be looked upon 
as a measure of the reliability of estimates based upon the - 
arithmetic mean of the variable X,;. For this we have 

0, = 384.477 


Clearly, the estimates from the equation are more reliable 
than those based upon the mean. The coefficient of 
correlation, r, expresses this relationship in abstract terms. 
We may get this value from the equation 
- ad (X1) + b.2(X1X2) — Nez 

2(X) — Ne? 


119 


MULTIPLE AND PARTIAL CORRELATION 489 


Solving for r, and giving it the sign of by», we have 
Tz = — .4814 


These values indicate a negative correlation, though not 
a high one, between yield per acre of corn and June tem- 
perature in Kansas. Let us see if the estimates could be 
improved if based upon the temperature in July instead 
of in June. 

The values needed in this study may be computed from 
Table 119. Solving for the constants in the equation of 
regression, we secure the equation 

X, = 827.64 — 9.302X3 


For the standard error, we have 


Si3 = 24.73 
and for the coefficient of correlation 
T13 = — .6968 


We have here a closer relation and a better basis for 
estimates than in the case when June temperature was 
considered. 

Repeating the process for yield per acre and August 
temperature, we have 

X, = 517.86 — 6.098.X4 
Sus = 29.98 
T14 = — .4937 

August temperature, it is evident, also affects the corn 
yield in Kansas, a low temperature conducing to yield 
above normal. The relationship is not so close as in the 
case of July temperature, but it is still significant. What 
is needed now is some method of combining these three 
factors, in order that an estimate may be based upon a 
knowledge of the influence of these factors, in combina- 
tion, upon the yield of corn. The addition or averaging of 
the temperatures in the three months will not do, for 
July is obviously more important than either of the other 
months. The principle of the method by which this may 
be accomplished is simple. 


490 THE MEASUREMENT OF RELATIONSHIP 


Tue EsTIMATION OF CorN YIELD FROM THREE 
INDEPENDENT VARIABLES 


The estimating or regression equation in the present case 
will be one in which there is a single dependent variable 
(corn yield) and three independent variables. It will be 


of the form 
X, = a+ Dye. 34Xe + dis.caX3 + bi4.23X4 


If we can determine the values of the four constants, we 
may substitute given values of X2, X3; and X, in the equa- 
tion and thus get an estimate for X, in precisely the same 
way as when two variables are dealt with. The method 
of least squares affords the means of solving for the re- 
quired constants. 

The symbols require a word of explanation, as a per- 
fectly simple equation is given a rather ponderous appear- 
ance by all the subscripts employed. The symbol by, 
it has been explained, represents the coefficient of regres- 
sion of X; on Xz (i.e. the slope of the line describing their 
relationship, X, being dependent) when these two variables 
alone are included in the study. The symbol by: repre- 
sents the coefficient of net regression of X,; on X2. The 
addition of the subscripts 3 and 4 to the right of the period 
means, simply, that the variables X; and X, have been 
included in the study and their effect eliminated, in so far 
as this one constant (bi:3,) is concerned. This constant 
measures the weight which must be given to the variable 
AX, in an estimate of X, based upon the three independent 
variables, X., X3 and X4. It will not, of course, be the same 
as by», which indicates the weight given to X. when an 
estimate of X, is based upon X, alone. Similarly the 
constant bi3.4, the coefficient of net regression of X, on X3, 
measures the weight given to X; when X_ and X, are also 
included. Each coefficient represents a single, simple con- 
stant, but the subscripts are necessary in order that the 
precise meaning of this constant may be clear. The sub- 


MULTIPLE AND PARTIAL CORRELATION 491 


scripts to the left of the period are termed primary sub- 
scripts, those to the right secondary subscripts. 


FoRMATION AND SOLUTION OF THE NORMAL 
EQUATIONS 


The first task! is the securing of the normal equations 
required in solving for the constants in the estimating 
equation given above. Following the usual procedure? 
we have: 
I 2(X1) = Na + bio.34 (Xe) + bis.04(X3) + bis.23 (X4) 
TIT 2(X.X2) = ad(Xe) + die.34d (X20?) + 013.04 (XoX3) 
+ dis.o3 (X2X4) 
TIT 2(X,X3) = aZ(X3) + Oro.346(XoX3) + 13,24 (X3") 
+ Dis.232(X 3X4) 
IV 2(X1X4) = a2 (Xs) + iv.s4@(XoX4) + 13.04 (NX 3X4) 
+ dis.232 (X,’) 


The given values might be substituted in these simul- 
taneous equations and solutions secured directly for the 
four constants. It is possible to reduce the number of 
normal equations by one, however, and thus lessen ma- 
terially the labor of computation. This is done by using 
‘deviations from the arithmetic mean for each variable 
instead of absolute values, getting rid in this way of the 
constant term a in the original equation. 

If we let A, Ao, A3, etc., represent the arithmetic means 
of the different variables while 2, 22, x3, etc., represent 
deviations from the means, we may replace the absolute 
numbers X 1, X2, X3, etc., by their equivalents, 21+ Au, 

1 The approach to the problem of multiple correlation which is here taken 
follows that of H. R. Tolley and M. J. B. Ezekiel (“‘ A Method of Handling Multiple 
Correlation Problems,” Journal of the American Statistical Association, December, 
1923, 993-1003). Statisticians are indebted to Messrs. Tolley and Ezekiel for the 
simple and effective method they have evolved. For details not here given, relating 


to the simplification of the normal equations, readers are referred to this article. 
2 Cf. Appendix A for a discussion of this procedure. 


492 THE MEASUREMENT OF RELATIONSHIP 


a2 + As, t3-+ Az, etc. Making these substitutions in the 
normal equations, certain algebraic simplifications are 
possible which eliminate the first of the normal equations, 
and reduce the others to the following form: 


g z a 
ae = a ) bie.s4 + 2 bis.c4 + a by 


j 2 (a3? 2(ax 
zee! = vie bye.34 + a ) bis.c4 + ae) bi4.23 
D2 2 L(aee 
Ee Ee aay SE set tase 
All the variables in Hs. above equations refer to deviations 
‘ 2 (aie 
from the respective arithmetic means. Therefore aud 


is simply the mean product of the variables x; and 2m, 
2 (29") 
N 
by the symbols py, pi3, etc., and inserting the symbols 
for the squares of the standard deviations, we secure, for 
the normal equations: 
Dia = O27b10.34 + Posbd13.24 + Poadis.c3 
Pis = P2sd12.34 + Os"biz.24 + Psadis.c3 
Dis = Poadie.sa + P3adis.c4 + CPb14.03 


is 022, etc. Representing the various mean products 


This is the most convenient form for the solution of the 
normal equations. 

From the data, as arranged in Table 119, the go eket 
values are derived: 


(Xi) = 3298.1 D(X) = 368846 .67 
(Xs) = 2426.9 (X22) = 178755 .75 
D(X3) = 2581.5 D(X) = 202163.79 
D(X.) = 2553.8 D(X2) = 197890.32 


D(XiX2) = 240967 .22 
D(XiX3) = 255954.11 
D(XLX,) = 253664. 85 
D(X2X;) = 189941.83 


MULTIPLE AND PARTIAL CORRELATION —§ 493 


D(X2X4) = 187909 .38 
D(X3X4) = 199845 .00 


a = UEXD 
N 

= 99.9424 CY = 9988 . 4833 

Co = 73.5424 C2 = 5408 . 4846 

C3 = 78.2273 es? = 6119.5105 

cy = 77.3879 c2 = 5988.8871 


From these values, the quantities necessary for the 
solution of the normal equations may be readily deter- 
mined. These quantities are brought together below: 


= sae — 9988 .4833 = 1188.688 

oy = ve — 5408 .4846 = 8.3564 

a? = a a — 6119.5105 = 6.6645 

o2 = wie — 5988.8871 = 7.7893 

a= Oe (99.9424 x 73.5424) = — 47.967 
pis = 265,004 tt — (99.9424 x 78.2273) = — 62.039 
pu = sa — (99.9424 x 77.3879) = — 47.519 
pes = an ~ (73.5424 x 78.2273) = 2.790 
pu = te — (73.5424 x 77.3879) = 2.932 


ee meee” ~ (78.2273 x 777.3879) = 2.063 


494 THE MEASUREMENT OF RELATIONSHIP 


Substituting in the normal equations, we have: 
— 47.967 = 8.3564D12.34 + 2.790b13.04 + 2.932b14.23 
— 62.039 = 2.790bi2.34 + 6.6645b13.04 + 2.063b14.03 
— 47.519 = 2.932bi0.34 + 2.063bi3.04 + '7.7893bi4.03 


Solving these simultaneous equations! we secure the 
following values for the constants: 
bio.34 = — 2.095 biz.24 = — 7.394 bis.23 = — 3.354 


The required equation is, therefore, 
a = — 2.095x2 — 7.39423 — 3.35424 


This is the equation of net regression of 2; on 2, 23 
and x24. Any given values of the three independent vari- 
ables (June temperature, July temperature and August 
temperature) may be substituted in this equation, and the 
most probable value of the dependent variable (corn yield 
per acre) determined. In the equation as it stands, it 
should be noted, all the variables are expressed as devia- 
tions from their respective arithmetic means. For practical 
purposes it is advisable to have an equation in terms of 
the original values. In other words, it is desirable to shift 
the origin from the point of averages to the zero point on 
the original scales. This necessitates re-introducing the 
constant term a. 

The value of a may be determined from the equation 

A; = 4+ Adbyo.34 + Asbis.c4 + Aabisas 


where the large A’s represent the respective arithmetic 
means.2 Inserting the proper values, we have 


1 Any method of solution may be employed. Perhaps the most convenient 
with three or more equations is the Doolittle method. This is explained in detail 
in Appendix A. 

2 This equation is derived from the first normal equation, as given on p. 491, 

2(X1) = Na + bies42(Xe) + bis.2462(Xs) + bra-232(X4) 
Replacing the absolute numbers Xi, X2, etc., by their equivalents, x2; + Ai, x + Ae, 
etc., we secure 
Z (x1) + NAi = Na + bys D (are) + NAc] + bis.ef 2 (a3) + NA3] 
+ dysesl 2 (x4) + NAg] 
Since X(a1) = 0, Z(a2) = 0, ete., these variables disappear. Dividing through by 
N we obtain the equation presented above. 


MULTIPLE AND PARTIAL CORRELATION 495 


99.9424 = a + (73.5424 x — 2.094816) 
+ (78.2273 x — 7.39374) + (77.3879 x — 3.353796)! 
Solving, 
a = 1091.94 
The equation of regression in terms of original values is, 
therefore, 
X, = 1091.94 — 2.095X_. — 7.394X3 — 3.354X, 


COMPUTATION OF THE STANDARD ERROR oF ESTIMATE 


Are estimates based upon this equation any more 
reliable than those based upon the equations previously 
derived, each of which referred to a single independent 
variable? To answer this question the value of the standard 
error must be computed. This will be represented in the 
present case by Si.%, the subscripts referring to the single 
dependent variable (X,) and the three independent vari- 
ables. This value may be computed from the formula ? 


S71 034 = O17 — Dyz.34P12 = bis.c4D13 = bis.osP14 


1 The arbitrary origin is at zero on each of the original scales, hence A; = «, 
Az = %, etc. To ensure greater accuracy in solving for a, the values of the co- 
efficients bio,34, 613,24, etc., are given to a greater number of decimal places than in 


the equation of regression. 
2 This formula may be derived as follows: Given an equation of the type 


Ly = byo.3402 + bis.243 + b14.23%4 
(in which the variables refer to deviations from the means) each residual may be 
computed from the equation 
d = Dyo.g4te + O13,0403 + b14.0304 — TW... ee ee eee ees (1) 
Multiplying throughout by d, and adding, we have 
Z(@) = Dyo,342 (daz) + b13.242(dxs) + bi4.232 (dax) — 2 (dar) 
but it follows from the method of fitting that 


2 (da) = 0 
2 (dz) =0 
2 (das) =0 
Pe NeTel Oren (2) (O41) Sowa ae lane Heel aa Kaine Pack van weds oi Fe (2) 


Multiplying each residual equation (1) by a and adding, we have 
D (dar) = bie.s4D (arte) + brs.242 (ares) + brs.25 (aaa) — D(a”) 


(Footnote continued on page 496) 


496 THE MEASUREMENT OF RELATIONSHIP 


Substituting the proper values, we have 


S*1.234 = 1188.688 — 100.482 — 458.700 — 159.369 
= 470.137 
Si.234 = 21.68 


This is to be interpreted just as the standard error was 
interpreted in previous cases. The reliability of estimates 
based upon the mean value of X, is measured by o1, which 
has a value of 34.477. The reliability of estimates based 
upon the equation of net regression, when yield is con- 
sidered as a function of temperature in June, July and 
August, is measured by S123, which has a value of 21.68. 
It is clear that estimates made from the equation are 
distinctly more reliable than those based upon a knowledge 
of X, alone. We have by no means accounted for all the 
factors that are responsible for variability in corn yield, 
but we have measured and reduced to precise terms the 
effect of three factors upon the yield of corn per acre 
in Kansas. 


THe CorEFFICIENT OF MULTIPLE CORRELATION 


We have need now of our third measure, the abstract 
coefficient of correlation. The value of this coefficient, as 
we have seen, depends upon the relation between S and o. 


Substituting the equivalent of 2 (da) in equation (2) we secure 
2 (ad) = 2 (a2) = bio,342 (2122) = bis.24 (123) — b14,932) (2124) 
= (d*) sy D(a’) ag D (ax) =e = (x23) b > (ai24) 
N a N 12.34 N 13,24 N — 14,23 N 


Since the variables refer to deviations from the means, we have 


S?i.034 7 


S*1.234 = O12 — bie.34Pi2 = bis.24Pis = bis.o3p14 
The standard error may also be derived from the equation in the general form, 
with the origin at zero on the original scales. For this general case we have 
2(X,2) a ad (X1) = Dio.s4 els na(X1X0) 
= bisa. = « nav(XaXs) — bus . . s noe Arkai—ne 
N 
A proof of these relations is set forth in Appendix A. 


S*1,034 cn eS 


MULTIPLE AND PARTIAL CORRELATION 497 


It may be computed in the present instance from the 
formula 
: 
R? 1.234 rag 1 ey ee 

When the relationship between a single dependent variable 
and several independent variables is being studied, this 
measure is termed the coefficient of multiple correlation 
and is represented by the symbol R. The subscript to 
the left of the period relates to the dependent variable, 
while those to the right relate to the independent variables. 
Substituting in this formula the equivalent of S*,.34, we have 


2 edie oY — Die. s4P19 = bis.c4P13 ¥ Dis.23P14 
Pra rN Re ay 
which reduces to! 
R be bie.sapie + b13.04p13 + Dis espis 
1.234 = Serra hae! Pee 

Inserting the proper values we have 
100.482 + 458.700 + 159.369 

1188 .688 


PR? 1.934 = 


Rios = .778 


The coefficient of multiple correlation is an index of the 
degree of relationship between a single dependent variable 
and a number of independent variables, in combination. 
It measures the degree to which variations in the dependent 
variable are related to the combined action of the other fac- 
tors. Its significance may be clearer if all the independent 
variables are looked upon as constituting a single inde- 
pendent series. The coefficient is then seen to be a measure 
of the relationship between the dependent variable and the 
independent series, which is precisely what the coefficient 

1 The coefficient of multiple correlation may also be derived from the general 


formula, which refers to an origin at zero on the original scales. This general 


formula is 


Py 234 oon 
ad (X1)+b12.34 alte no (X1X2)+bis.24 Fits nD (X1X3)+b14.03 eens na (X1X4)+ ...—-Ne? 


2(X1)? — Ne? 


498 THE MEASUREMENT OF RELATIONSHIP 


of correlation is in the simpler case of two variables. In 
the multiple case the independent series has several com- 
ponent elements, but this fact does not alter the essential 
significance of the coefficient. No positive or negative sign 
is attached to R, it should be noted. In the present instance 
all of the independent variables are negatively correlated 
with corn yield, and a negative sign might be attached. 
It might be the case, however, that the correlation were 
positive for some of the independent variables, and negative 
for others. Because of this fact, R is always given without 
sign. The signs of the constants in the equation of net 
regression show which of the independent variables are 
positively correlated and which are negatively correlated 
with the dependent variable. 


CoMPARISON OF MEASURES OF RELATIONSHIP 


The degree to which our knowledge of the causes of 
variation in corn yield has been improved and the re- 
liability of our estimates increased by taking account of 
the various factors in combination may be more readily 
appreciated if we bring together the various measures 
secured in the course of this analysis. 


TABLE 120 
A Comparison of Certain Measures Pertaining to the Corn Yield in Kansas 
Measure of Coefficient 


Basis of Estimate reliability of 
of estimate Correlation 
Arithmetic mean of X; Oy = 34.477 
Xy = 522.31 — 5.743Xe Sy = 30.22 Te = — .4814 
AX, = 827.64 — 9.302X3 Sig = 24.73 713 = — .6968 
4X1 = 517.86 — 6.098X4 Sig = 29.98 T14 = — .4937 


X, = 1091.94 — 2.095 X2 — 7.394X3 — 3.354X,4 Sy.24 = 21.68 Ryo = .778 


The value of S might be further reduced and the value 
of R correspondingly increased by bringing into the analysis 
other factors, such as rainfall during the growing months. 
The method which has been explained may be extended 


MULTIPLE AND PARTIAL CORRELATION 499 


to cover any number of variables, one equation being 
added to the set of simultaneous equations for each addi- 
tional variable introduced. 


Tuer Metuop or MULTIPLE CORRELATION VALID FOR 
LINEAR RELATIONSHIPS 


One important condition has not been emphasized in 
the course of the preceding discussion. The validity of this 
method of multiple correlation depends upon the existence 
of a linear relationship between each pair of variables. 
Thus with four variables there were six pairings possible 
(i.e. six mean products were computed). If there had been 
a material departure from linearity in any of these six 
relationships the significance of the results would have 
been decreased. There would be no fallacy involved in 
the use of the equation under these conditions, but it 
would not furnish as good a basis for estimates as one 
which took account of the true relationship. In such a case 
the values of S and R would indicate that the estimates 
based upon the assumption of linear relationship were not 
very reliable.! 


AN APPLICATION OF THE METHOD 


Let us illustrate the use of the estimating equation. 
In the year 1922 the average June temperature in Kansas 
was 75.2° F., the average July temperature was 77° F. and 
the average August temperature was 80.1° F. What was 
the probable corn yield per acre? Substituting these values 
for X2, X3, and X, in the equation, 


X, = 1091.94 — 2.095X_ — 7.394X3 — 3.354X, 


1 An approach to problems of multiple correlation when the relationship 
between the subordinate series is non-linear is explained by M. J. B. Ezekiel in 
“A Method of Handling Curvilinear Correlation for any Number of Variables.” 
Journal of the American Statistical Association, Vol. XIX, N.S. No. 148, 1924. 


500 THE MEASUREMENT OF RELATIONSHIP 


we have 
Xi = 1091.94 — (2.095 x 75.2) — (7.394 x 77.0) 
— (3.354 x 80.1) 
Xi = 96.4 

This result expresses the actual yield as a percentage 
of normal. The equation to the line of trend of corn 
yield is 

Y = 19.967 — .1502X 
(where Y represents the yield and X the time factor, with 
origin at 1906). From this equation we secure the normal 
yield for 1922, finding it to be 17.6 bu. per acre. The 
estimated yield is 96.4 per cent of this, or 17 bu. per acre. 

What are the limits within which we may expect the 
actual yield to fall, with respect to this estimate? The 
value of S123 is 21.68. This means that the odds are 68 
out of 100 that the actual yield ratio will be within the 
limits 96.4 + 21.68, or 118.1, and 96.4 — 21.68, or 74.7. 
These values refer to actual yield as percentage of normal. 
Knowing the normal to be 17.6, these limits may be put 
in terms of bushels. In this form, we have 20.8 bushels 
per acre and 13.1 bushels per acre as the limits within 
which the actual value should fall in 68 cases out of 100. 
The actual yield in 1922 was 19.3 bu. per acre. 

In this illustration we have used one of the years included 
in the study. The same method would be employed in 
making an estimate for some future year, but in such a 
case an additional element of uncertainty would be intro- 
duced by the projection of the line of trend. 


Tue Meanina or Partiau or Net CorrELATION 


In the preceding section we have sought to determine 
the degree to which corn yield in Kansas is affected by the 
temperature in June, July and August, treating the three 
independent variables in combination. Our aim has been 
to measure their combined effect upon corn yield. There 


MULTIPLE AND PARTIAL CORRELATION 501 


is a related problem, which in many studies may be of 
major importance. This is the determination of the 
relationship between a dependent variable and a single 
independent variable when all other factors are held constant. 
Concretely, what would be the effect upon corn yield of 
variations in July temperature, if June temperature and 
August temperature, and all other factors affecting yield, 
could be held constant? This is the problem of net or 
partial correlation. 

It is obvious that if a method could be developed by 
which a single factor could be thus isolated for separate 
study, it would add immeasurably to the analytical powers 
of the economist, and of social scientists in general. It 
would give to the student in these fields that power to 
eliminate irrelevant influences and to concentrate his 
attention upon a single factor which is possessed by the 
chemist, for example. In studying the effect of one element 
upon another the chemist seeks to eliminate all other 
elements, and the effectiveness of his analysis depends in 
large part upon the degree to which it is possible thus to 
isolate the object of immediate interest. 

It is not generally possible in economic analysis to 
eliminate all but one of the factors responsible for varia- 
tions in a given series. The direct and indirect causes of 
a given economic phenomenon are too numerous and too 
complicated in their interaction for the economist ever to 
hope to emulate the chemist in reducing his problem to 
terms of but two variables. But, within certain limits, 
the statistician is able to employ the method of the physical 
scientist in holding constant certain factors while the 
effects of variations in another are studied. The methods 
which make this possible are among the most powerful 
of the instruments which the student of the social sciences 
possesses. 

The method of partial correlation may be explained with 
reference to the problem of corn yield in Kansas. Our 


502 THE MEASUREMENT OF RELATIONSHIP 


object is to determine the net correlation between corn 
yield and the temperature in each of the three months 
for which the average temperature is given. 


DISTINCTION BETWEEN PARTIAL AND SIMPLE 
CORRELATION 


It is important to distinguish between this problem and 
that faced in the ordinary measurement of relationship 
between two variables. We have already secured, as a 
description of the average relationship between corn yield 
and July temperature, the equation 


X1 = 827.64 — 9.302X3 
with 

Sis = 24.73 
and 

T13 = — .697 


These measures describe the relationship in question when 
all other factors are ignored. They are not taken account 
of. They are merely neglected. It is as though the chemist, 
in studying the reaction of one element to another, used 
a test tube containing various impurities, which he made 
no attempt to remove. The economist cannot, in general, 
locate and remove all the “impurities” in his problem, but 
he should recognize that his measures relate to such un- 
corrected data. 


Toe Metunop or PartiaL CoRRELATION 


In seeking to determine the net correlation between 
corn yield and July temperature we attempt to secure a 
measure of the correlation which would prevail if all other 
factors might be held constant. We shall take full account 
of the other factors we have studied, but we shall try to 
secure a measure which will be influenced only by fluctua- 
tions in July temperature, in relation to corn yield. 


MULTIPLE AND PARTIAL CORRELATION 503 


One possible method of accomplishing this end may be 
suggested. If one possessed data covering a very long 
period we might be able to pick out a number of years 
during which the average temperatures in June and August 
remained unchanged. Let us say that we could find thirty 
years in all during each of which the June temperature aver- 
aged 74° and the August temperature 78°. Corn yield and 
July temperature varied during these years. The rela- 
tionship between July temperature and corn yield might 
now be measured, and it would be certain that the results 
would not be affected by the presence of fluctuations in 
June temperature and August temperature. Unfortunately, 
this method of holding certain factors constant cannot be 
employed. The data are too limited and too varied, in 
general, to enable us to pick from among them such figures 
as are appropriate to our purpose. Other methods of 
arriving at the same end are available, however. 


Tue RELATION BETWEEN THE COEFFICIENTS OF REGRES- 
SION AND THE COEFFICIENT OF CORRELATION 


In introducing this method, let us return for a moment 
to the problem of simple correlation. We have seen that 
the equation which describes the relationship between two 
variables may be expressed in two forms, one with y 
dependent, the other with x dependent. These equations, 
with variables relating to deviations from the respective 
arithmetic means, are of the form 


Se 
Ye Oo; 
and 
pare 
oy” 


: o Oz 
In these equations the values r ine and r 5, are termed the 
z v 


504 THE MEASUREMENT OF RELATIONSHIP 


coefficients of regression. They measure the slopes of the 
two lines of regression. The coefficient of correlation is 
equal to the square root of these two coefficients of regres- 
sion. That is 


Using the symbols ordinarily employed for these values 
we have 


Cyc = Vv b a 0zg 


y 
where b,, is the coefficient of regression of y on x (y de- 
pendent) and b,, is the coefficient of regression of x on 
y (x dependent). If we use the symbols 2; and x, instead 
of y and x, and make corresponding changes in the sub- 
scripts, we have 
M12 = V bio: ba 

The value of 7, it may be noted, is the same whether x; 
or 2 be dependent. That is, ry = 721. 

This leads us at once to the solution of our present 
problem. We seek a coefficient of net correlation between 
two variables, a measure of their relationship when other 
factors are held constant. But in the study of multiple 
correlation we have already derived certain coefficients 
of net regression. The equation of regression, which de- 
scribes the relationship between a single dependent and 
a number of independent variables, is of the form 


Xy = A+ Dyo.34X eo + diz2aX3 + dia.03X4 


The quantities bio.34, bis24 and bi43 are the coefficients of 
net regression. They indicate the weight to be attached to 
each of the independent variables when an estimate is 
based upon the three in combination. Thus, the coeffi- 
cient 61324 is a measure of the weight to be attached to the 
variable X3 in estimating X,, when variables X, and X, are 
also taken account of. We may secure the values of these 
three coefficients, as we have seen, by the simple method of 
solving simultaneous equations. 


MULTIPLE AND PARTIAL CORRELATION 505 


To get the desired coefficients of net correlation by this 
method, however, we need additional values. The coeffi- 
cient r,, is the square root of b,, and 6,,. The coefficient 
713 is the square root of bi; and b3;. Similarly, the coeffi- 
cient we desire, which measures the degree of relationship 
between X, (corn yield) and X3 (July temperature) when 
A, (June temperature) and X, (August temperature) are 
held constant, may be derived from the formula: 


13.24 = V by3.24° bs1.24 


In this equation bi324 is the net regression coefficient of 
X, on X3, and bsio4 is the net regression coefficient of 
Ag 3 ON £6 1- 

The meaning of these symbols has been made clear in 
the course of the discussion. The subscripts indicate the 
particular relationship to which the values apply. The 
' two subscripts to the left of the point (the primary sub- 
scripts) refer to the variables to which the measure applies 
specifically, while the subscripts to the right of the point 
(the secondary subscripts) indicate the variables which 
are held constant for the purpose of the particular com- 
parison being made. The number held constant is two, in 
the present case, though it might be one, or any other 
number. Thus the general formula for the coefficient of 
net correlation between variables X, and X; would be 
written 


113.2456++-n= V bis.2456 Pe waa ibect an 


Of the two primary subscripts the first refers to the de- 
pendent variable, the second to the independent variable. 

To solve for riz24 in the problem before us only one 
additional value is needed. We have the value of the 
coefficient 61304, having obtained this in the preceding 
illustration. The coefficient of regression 031.24 is one of 
the constants in the equation: 


X3 = a+ bai.24X1 + O3014X2 + D34.01X4 


506 THE MEASUREMENT OF RELATIONSHIP 


which is the equation describing the average relationship 
between July temperature, as dependent, and June tempera- 
ture, August temperature and corn yield as independent 
variables. Given certain values of X;, X2 and X,, this 
equation enables us to estimate the most probable July 
temperature which would accompany these values. (It 
describes an association, rather than a relation of de- 
pendence, when expressed in this way.) 


SoLuTION OF NorMAL EQUATIONS 


The required values are secured just as in the example 
presented above. Four normal equations are secured in 
the usual fashion. These may be reduced to the three 
which follow: 

Dis = O17b31.24 + Pi2dse14 + Prad3a.c1 
Pos = Piedsi.24 + O27b30.14 + Poadss.o1 
p34 = Pradsi.oa + Poadsors + 034.21 


The mean products and the standard derivations needed 

for insertion in these equations have all been used in the 

solution of the preceding set, with the single exception of 
>. Inserting these values we have 


O71. 
~ 62.039 = 1188.688bs1.24 — 47.967bso14 — 47.519Da401 
2.790 = — AT .967b31.04 + 8.3564b30.14 + 2.932bs4.01 
2.063 = — 47.519b31.04 + 2.932b30.14 + 7.7893b34.01 


Solving these equations we secure the following values: 


bsi24 = — .0531 
bseig = + .0574 
bs4.01 =— .0807 


CoMPUTATION OF CORFFICIENT OF Net CorRRELATION 


From the preceding example we have the value 
bis.e4 = — 7.394 


Substituting in the formula 


MULTIPLE AND PARTIAL CORRELATION — 507 
113,24 = V bis.04 X Oa1.24 


we have 


fiseg =v 7.994. = .0681 
= — 6266 


(As in the simpler case in which only two variables are 
dealt with, r takes the sign of the coefficients of regression, 
which will be alike in this respect. If both regression 
coefficients are positive, 7 is positive, if both are negative, 
r is negative.) 

This is the coefficient of net correlation between corn 
yield and July temperature in Kansas. It measures 
the degree of relationship between these two variables 
when June temperature and August temperature are held 
constant. 


ANOTHER MetHop or CoMPUTING COEFFICIENTS OF 
ParTIAL CORRELATION 


Obviously a whole series of such coefficients can be 
worked out in dealing with a number of variables. In 
computing a number of such measures a method may be 
utilized which differs somewhat from that employed above, 
and which has certain advantages in the way of systematic 
arrangement. 

A simple coefficient of correlation relating to but two 
variables is termed a coefficient of zero order. Such coeffi- 
cients are represented by symbols of the type riz, 724, ete. 
Coefficients of net correlation which relate to two variables, 
while a single additional variable is held constant, are 
termed coefficients of the first order, and are represented by 
symbols such as 712.3, fo4.3, etc. Similarly, we may have 
coefficients of the second, third, fourth or nth order, de- 
pending upon the number of variables held constant while 
the relationship between a single dependent and a single 
independent variable is being measured. 

It is possible to derive each coefficient of partial correla- 


508 THE MEASUREMENT OF RELATIONSHIP 
tion from those of the next lower order. Thus a coefficient 
of the first order may be derived from the relation 
¥ Tis hs 123 
mo T= Pa) T= Pa)! 
For a coefficient of the second order 


f a TAD oe TAS 243 
m8 (1 = Pras)? (L — reas)? 


As a general equation for a coefficient of net correlation 
of any order, we have 


112.345 + © © (n—1) — 11.345 + + + (n—1) -72n.345 + © + (n—1) 


1p Ch cei —= aay 
ras i el aT in Bae oleyr (n—1))? el yoy 848 «3 8 (n-1))? 


Thus it is possible, starting with the zero order coeffi- 
cients of correlation, to compute all higher order coefficients 
successively. The mere arithmetic of calculation would be 
laborious, but certain prepared tables reduce these compu- 
tations to a minimum.! The method may be illustrated, 
using the data of the preceding problem. 

In the present case we require three coefficients of the 
second order, 71234, 113.2 @aDd 711493. These will serve as 
measures of the net correlation between corn yield and 
temperature in each of the three critical months. The 
formula from which the first of these measures may be 
computed was given above. For the second, we have 


r > INO) SE SG 
13.24 d a 244.9)4 (1 a 734.9)3 


and for the third 


r = 114.2 — 113.2.743.2 
14.23 ( = 713.0) a ads 143.9)4 


But each of these values may be derived from a slightly 
different grouping of first order coefficients. We may use 
the three formulas 


1 J. R. Miner, Tables of V1 — 12 and 1— 7 for use in Partial Correlation 
and in Trigonometry, Johns Hopkins Press, Baltimore, Md., 1922. 


MULTIPLE AND PARTIAL CORRELATION 509 


ee 12.4 — 113.4.723.4 
P (1 — 7713.4)? (1 — 793.4)? 


e es 113.4 — 112.4.132.4 
WA = Peedi 0 — 20)! 


ee TAS TOR SIZ OLE} 
j (1 — 142.3)? (1 — 7249.3)? 


By employing both methods in computing each second 
order coefficient a check upon the calculations is afforded. 


COMPUTATION OF First ORDER COEFFICIENTS 


The second order coefficients cannot be computed until 
all necessary first order coefficients have been secured. 
The necessary equations, of the type 


x na USI) = GOR: F 
12.3 tf .> 7243)3 (1 = 7253)3 


may be constructed from the general formula for coeffi- 
cients of partial correlation. Since several of these values 
must be computed, a systematic procedure should be 
adopted. The tabular arrangement on page 510 facilitates 
the computations and lends itself readily to the employ- 
ment of the tables mentioned above. 

The procedure in computing each first order coefficient 
is simple. Three zero order coefficients are necessary for 
each calculation. These should be arranged in the table 
in the order in which they occur in the numerator of the 
fraction from which the required coefficient is to be com- 
puted. The numerator of this fraction is secured by 
subtracting from the first zero order coefficient the product 
of the other two. This product term appears in one column 
of the table. The denominator of the fraction is the 
product of two terms of the type V1 — 7’, derived from the 
second and third coefficients in each group of three. The 
tabular arrangement permits these computations to be 
carried forward systematically, 


510 THE MEASUREMENT OF RELATIONSHIP 


TABLE 121 


Illustrating the Computation of First Order Coefficients of Partial 
Correlation 


Kansas Corn Yield and Temperature 


r 0 Order r 1st Order 
(-)} Se Whole | Denomi- 
Sub- Coef- numerator | nator Sub- Coef- 
. numerator 

seript| ficrent script ficient 
12 — .4814 — .2604 — .2210 .6653 12.3 — .3322 
13 — .6968 7173 

23 + .3737 9275 

14 — .4937 — .1994 — .2943 ..6873 14.3 — .4282 
13 — .6968 W173 

43, + .2862 9582 

24 + .3633 +.1070 + .2563 .8887 24.3 + .2884 
23 + .3737 9275 

43 + . 2862 9582 

13 | —.6968 —.1799 | —.5169 | .8130 | 13.2 | —.6358 
12 | —.4814 | .8765 

32 + .3737 9275 

14 | —.4937 —.1749 | —.3188 | .8166 | 14.2 | —.3904 
12 | —.4814 | .8765 

42 | +.3633 9317 

34 + .2862 +.1358 +.1504 .8642 34.2 +.1740 
32 +.3737 9275 

42 + .3633 9317 

12 4814 —.1794 — .3020 8102 12.4 — .3727 
14 — .4937 8696 

Q4 + .3633 9317 

13 | —.6968 —.1418 | —.5555 | .8333 | 13.4 | —.6666 
14 | —.4937 | .8696 

34 | +.2862 | .9582 

23 + .3737 +.1040 + .2697 8928 23.4 + .3021 
24 + .3633 9317 

34 + .2862 9582 

14 — .4937 —.1994 — .2943 6873 14.3 — .4282 
13 — .6968 7173 

43, + .2862 9582 

12 | —.4814 —.2604 | —.2210 | .6653 | 12.3 | —.3322 
13 | —.6968 | .7173 

23 | +.3737 | 9275 

42 + .3633 +.1070 + .2563 .8887 42.3 + .2884 
43 + .2862 .9582 


23 + .3737 9275 


MULTIPLE AND PARTIAL CORRELATION | 511 


The coefficient 1234 is, of course, identical with 73.4; 
r34.2 is identical with 1432, etc. It is unnecessary to duplicate 
the work of computation with respect to these measures. 


COMPUTATION OF SECOND OrpDER COEFFICIENTS 


From these first order coefficients the three required 
second order coefficients may be secured by methods 
analogous to those employed above. The computations 
are shown in the following table. As a check upon the 
calculations each required measure is computed from two 
different combinations of the first order coefficients. 


TABLE 122 


Illustrating the Computation of Second Order Coefficients of Partial 
Correlation 


Kansas Corn Yield and Temperature 


r 1st Order Product : r 2nd Order 
Raub) Coe 1 = 95)8/) 4 of Whole \Denome-| = 
Sul og ma orm OF | numerator| nator Sub- Coef- 
script | ficient numerator script cient 
12.3 | —.3322 ~.1935 | —.2087 | .8653 | 12.34 | —.2412 

14.3 — .4282 9037 
24.3 +.2884 9575 
13.2 | —.6358 _—.0679 | —.5679 | .9065 | 13.24 | —.6265 
14.2 — .3904 9206 
34.2 +.1740 9847 
14.2 | —.3904 —.1106 | —.2798 | .7601 | 14.23 | —.3681 
13.2 — .6358 7719 
43.2 +.1740 9847 
12.4 — 38727 — .2014 — .1713 7106 12.34 — 2411 
13.4 — .6666 TA5A 
23 .4 +.3021 9533 
13.4 — .6666 —.1126 — .5540 8847 13.24 — .6262 
12.4 — .3727 9280 
32.4 +.8021 9533 
14.3 | —.4282 ~.0958 | —.3324 | .9031 | 14.23 | —.3681 
12.3 — .8322 9432 
42.3 + . 2884 9575 


512 THE MEASUREMENT OF RELATIONSHIP 


The value of 71324, it will be noted, is the same as that 
derived from the two coefficients of net regression. 

The meaning of such coefficients as these was explained 
in the earlier section dealing with this problem. The 
following summary of results may make clear the gain in 
knowledge which has resulted from the above analysis. 


Ti2=- 4814 712.34 = — 2412 
N13 = - .6968 713.24 = — .6265 
M14= 4937 714.23 = — 3681 


It is clear that the net effect of June temperature upon 
corn yield is distinctly less than was indicated by the 
simple correlation. This is so because there is a positive 
correlation between temperature in June and temperature 
in July and August, so that the crude correlation of two 
variables alone shows June temperature as more important 
than it really is. For the same reason, all the net coeffi- 
cients are less than the simple coefficients, though it is 
still apparent that July temperature is far more important, 
in relation to corn yield, than the temperature in either of 
the other months. 

The coefficients of net correlation are net, of course, only 
with respect to the variables actually taken account of, 
and held constant. Thus there may be other factors, such 
as rainfall in June, July or August, which affect corn yield 
and which are correlated with the temperature during these 
months. Were these included the various coefficients of 
net correlation might have higher or lower values than 
those given. 


A Measure oF VARIABILITY 


Having these coefficients of net correlation, another 
measure of some importance may be computed. This is 
a measure of the variability of a single character while a 
number of related variables are held constant. Thus the 
question might arise: If we could hold constant the tempera- 


MULTIPLE AND PARTIAL CORRELATION 513 


ture in Kansas in June, July and August, what would be 
the variability of the corn yield? In other words: If we 
could eliminate such variability in corn yield as is due to 
variability in temperature, what fluctuations would re- 
main in the yield of corn? This measure of variability is 
represented by the symbol oi234...,. It is termed the 
standard deviation of order n. 

This measure may be computed from the general equation 


T123-5<n= or(1 mr 7719) (1 = 713.2) (1 a 114.93) 664 (1 — Ping... par) 


Applying this formula to the results of the study of corn 
yield, we have 
01234 = 1188. 688[1 — (— .4814)*] [1 — (— .6358)2] [1 — (— .3681)2] 
071234 = 470.3364 
01.234 = 21.69 


Referring back to the discussion of this problem we find 
that the values of o1.234 and S134 are identical. (There is | 
a difference of .01, as derived above.) That is, the standard 
deviation of variable X;, when variables X., X3 and X, 
are held constant, is merely the measure of scatter about 
the line of regression. It is the standard error of estimate, 
when estimates are based upon the factors X2, X3 and X4. 
The reason for this is obvious. The variability of the 
original series is reduced to the extent that estimates 
based upon the equation of relationship approximate the 
actual values. The variability which remains is due to 
differences between these estimates and the actual values. 
But these differences are merely the deviations from the 
line of regression, from which S is computed. A realization 
of the identity of these two measures may assist in making 
their meaning clear. 

Since o4034 and Sj.34 are identical, the coefficient of 
raultiple correlation, Ri.234, may be computed from the 


equation 
ts S SEBEL, 


Riz23..-n= 1 - 
1,23 n o 


514 THE MEASUREMENT OF RELATIONSHIP 


or, using the formula for 07123, . . . ”, from the equation 
Vo aioe. on = (1 = 7712) (1 = 1713.2) @! = 1714.93) eee 
(1 — Pines - +» (m2) 


The method first presented, based directly upon the method 
of least squares, furnishes a simpler approach to the prob- 
lem, however, and is in general to be preferred. 


CERTAIN LIMITATIONS 


The measures we have described in dealing with problems 
of multiple and partial correlation are valid on the assump- 
tion that the relationship between the different variables 
is in all cases linear. Thus with four variables six different 
pairs may be obtained. The regression in each of these 
six cases should be linear if combined or net effects are to 
be studied by the methods outlined above. If the regres- 
sion is non-linear when natural numbers are dealt with, 
it may be possible to secure linear relationships by corre- 
lating logarithms or reciprocals. Thus we might derive an 
estimating equation of the type 


Log X1 = a + die.s4Xo + b13.24X3 + bisosX4 


if the relation between X, in logarithmic form and each 
of the other variables in the original arithmetic form were 
linear. The corresponding measures, S and R, would then 
relate to ratios, as in the examples given in the preceding 
chapter. 

One other important limitation should be noted. Coeffi- 
cients of multiple or of net correlation based upon a large 
number of variables have little significance unless the 
number of observations be large. Misleadingly high values 
will be secured when studies involving many variables 
are based upon small samples. Within the limits set by 
these restrictions, the methods of multiple and _ partial 
correlation constitute very powerful instruments for eco- 
nomic analysis. 


MULTIPLE AND PARTIAL CORRELATION 515 


REFERENCES 


Bow.ey, Artuur L. Elements of Statistics (898-408). 

EpeewortH, F. Y. On Correlated Averages. Phil. Mag. 5th 
series, Vol. XXXIV, 1892 (194). 

Ezexiet, M. J. B. A Method of Handling Curvilinear Correla- 
tion for any Number of Variables. Journal of the American 
Statistical Association, Vol. XIX, N. S. 148, 1924. 

Haas, G. C. Sale Prices as a Basis for Farm Land Appraisal. 
Technical Bulletin No. 9. University of Minnesota Agri- 
cultural Experiment Station. 

Kewiey, Truman L. Statistical Method (279-310). 

Kewity, Truman L. Partial and Multiple Correlation (in 
Rietz, H. L. Handbook of Mathematical Statistics, 1389-149). 


Miner, J. R. Tables of. V1 — 1? and 1—r° for use in Partial 
Correlation and Trigonometry. 

Peart, Raymonp. Medical Biometry and Statistics (319-331). 

Pearson, Karu. Regression, Heredity and Panmixia. Phil. 
Transactions Royal Society, Series A. Vol. CLXXXVII, 
1896 (253-318). 

Smitu, Braprorp B. The Use of Punched Card Tabulating 
Equipment in Multiple Correlation Problems. (Prepared for 
the use of statisticians of the Bureau of Agricultural Eco- 
nomics, U.S. Dept. of Agriculture.) 

Touuey, H. R. and Ezexier, M. J. B. A Method of Handling 
Multiple Correlation Problems. Journal of the American 
Statistical Association, Dec. 1923. 

Yue, G. U. An Introduction to the Theory of Statistics (229-253). 
(The notation usually employed in the correlation of several 
variables was developed by Yule. It is explained in this 
reference.) 


CHAPTER XV 


ELEMENTARY PROBABILITIES AND THE NORMAL 
CURVE OF ERROR 


Reference has been made in an earlier section to the 
family resemblance which is found in frequency distribu- 
tions drawn from widely different fields. Attention was 
also drawn to a certain basic type which is represented 
graphically by the symmetrical bell-shaped curve which is 
called the “‘normal curve,” or the “‘normal curve of error.” 
In an earlier day this curve was looked upon as representing 
a fundamental law which described all distributions of 
quantitative data. From the modern standpoint this was 
quite an erroneous conception. The normal curve is viewed 
today as but one of a number of types of curves which may 
be used to describe frequency distributions. It is, however, 
by far the most important type, and an understanding of 
its characteristics is essential to the statistician. 


ELEMENTARY THEOREMS IN PROBABILITY 


This understanding may best be secured by a brief 
consideration of certain elementary principles of probabil- 
ity. A detailed explanation of the theory of probability 
would carry us far beyond the limits of the present volume. 
The treatment which follows is presented only as an intro- 
duction to the subject, designed to illustrate, by simple 
numerical examples, the relation between the principles of 
probability and the normal law of error. 

In this argument we may use the following standard 
notation. If an event can occur in n ways, a of which are 
to be considered as successful and 6 as unsuccessful, the 


probability p of a successful outcome may be written 
516 


THE NORMAL CURVE OF ERROR 517 
ae 
Pa 
and the probability g of an unsuccessful outcome may be 
written. 
q —4 


Since the sum of the favorable and unfavorable outcomes 
is equal to the total number of events, we have 


atb=n 
Dividing by n, 
Go 
SS = 1 
nn 
so that 
pt+q=1 


or certainty. 


A probability, therefore, may be written as a ratio. The 
numerator of the fraction corresponding to this ratio 
represents the number of favorable (or unfavorable) out- 
comes, while the denominator represents the total number 
of possible outcomes. 


EXAMPLES OF SIMPLE PROBABILITIES 


If a coin be tossed, the turning up of a head being looked 
upon as a favorable outcome, we have, as the probability 
of a success, 


_l 
P93 
and of a failure, 
1 
qs 2 


If we roll a die, regarding a six spot as a favorable outcome, 


ool 
oo 6 


518 ELEMENTARY PROBABILITIES 


and 


S| Or 


q = 
If a card be drawn from a pack of 52 the chance of drawing 


1 ee 51 
the ace of spades is 52° of failing in that endeavor, 52. 


Tuer ADDITION OF PROBABILITIES 


What is the chance of securing evther an ace of spades 
or a two of spades in a single draw from a pack of 52 cards? 
In such a case, where any one of several outcomes will be 
considered as favorable, the probability of a success is 
the sum of the separate probabilities. In this example 


bs Slee eenih 
P= 591 52 ~ 26 
The chance of drawing either a heart or a spade from a 
pack of playing cards is given by 


Tue MULTIPLICATION OF PROBABILITIES 


Two events are said to be independent when the outcome 
of one does not affect the outcome of the other. Thus the 
result of one throw of a die does not, presumably, affect the 
result of the next toss. The probability of a compound 
event (i.e. that two events, independent of one another; 
will both occur) is the product of the probabilities of the 
separate events. Thus the chance of securing an ace, 
followed by a two spot, in two successive throws of a die, 
is given by 

lve at 1 
PT Gr 6: + 86 

In computing the probability of a given outcome, it is 

frequently necessary both to multiply and to add probabili- 


THE NORMAL CURVE OF ERROR 519 


ties. For example, we wish to determine the chance of 
securing the total 5 from two dice thrown simultaneously. 
We may label the dice a and b to distinguish them. This 
total may be secured from any one of the four following 
combinations 


Die a Die 6 
1 4 
2 3 
3 2 
4 1 


The chance of securing an ace with die a is * of securing 
: ce Seat y nee 
a 4 with die 0 is 6 The chance of the two in combination 


is 35 Similarly, the probability of each of the other three 


combinations is 36 But any one of these four results will 


give a total of 5, and will be considered successful. Hence 


1 1 1 1 1 

P= 36 * 36 36°36 9 
We have in this example answered the question: What 
is the probability of securing exactly 5 in the toss of two 
dice? We might put the question: What is the chance of 
securing at least 5 in the toss of two dice? In this case a 
total of 5 or more will be considered a favorable outcome. 
Just as in the preceding example, we may work out the 
probability of securing each of the results which will be 
accepted as successful. The following summary indicates 

the probability of each of these totals: 


Probability of throwing 12 with two dice = 


ll ll 


Sl= Sle Sle Sl- 


a 
* 
“ 
a 
a“ 
“ 
© 
a 
a 
“ 
* 
a 
a 

ll 


520 ELEMENTARY PROBABILITIES 


: 10 

Brought forward from preceding page 36 
; : : : 5 
Probability of throwing 8 with two dice = 36 
ce ee “eé 7 “ce ““ ee ae 6 

~ 36 

“c “< ‘“ 6 ‘“ “ ccs Oy 

> 86 

14 ce ee 5 ee ee “se =— ae 

~ 36 

ee 0 

Sum of above probabilities = 36 


The chance of throwing at least 5 in the toss of two dice is, 


30 5 
therefore, 36 or 6° 


Ture Brinom1iAL ExpANSION AND THE MEASUREMENT OF 
PROBABILITIES 


It is possible to express these facts in a generalized form. 
A simple illustration may be employed to exemplify the 
derivation of the general expression. 

If two coins are tossed simultaneously there are four 
possible outcomes 


a.b. .@)bim ab 

Lt Tens Gel oh 
(The two coins are represented, respectively, by the letters 
a and 6.) The chances of securing no heads, one head and 


: Lad 1 : 
two heads are, respectively, ree and a If three coins 


(represented by the letters a, b and c) are tossed simul- 
taneously, we have eight possible outcomes 


abe abe abe abe atc abe “abc Yauc 


PTL TTHSTHH THE ATT ATR PAT eae 


THE NORMAL CURVE OF ERROR 521 


The chances of securing no heads, 1 head, 2 heads and 
3 


3 heads are, respectively, » - Sg) 

But these results may be derived without the necessity 
of working out the separate probabilities in detail. We have 
employed p and q to represent, respectively, the probability 
of success and failure of a given event. If there are two 
independent events, the probabilities being the same in 
each case, the compound probabilities are given by the 
expansion of the expression 


(pr)? 


In the present case p (the probability of throwing a head) 
= gq = 3, and the probabilities of the various results are 


given by 
Teele, Lt 
c , 4 Sasa ts 
These are the results secured in the first example. If 
there are three independent events we have 
eel Se ook 
Ei) TS Ses 68 
the probabilities secured in the second example. 

If we wish to know not the separate probabilities but the 
probable frequencies of the various outcomes in a given 
number of trials, these may be computed from the ex- 
pression 

N(p + q)” 
where N represents the number of trials and n the number 
of independent events. Thus if there are 200 trials and 
there are two independent events, the probable frequencies 
are given by 
200(p + 9)” = 200(p? + 2pq + 9°). 


With p = q = § this gives us 


200(;) + 200(5) + 200(7) = 50+ 100 + 50 


522 ELEMENTARY PROBABILITIES 


which indicates the probable frequencies of 2 successes, 
1 success and no successes. 

If there are three independent events, the probable 
frequencies in N trials are determined from the binomial 


expansion of 
N(p + 9) 


If N equals 200, we have 
200(p* + 3p’¢ + 3pq + 4g) 


If p equals » we have 


1 3 3 1 
200(5) + 200(5) + 200(5) + 200(5) 
These terms indicate, in order, the probable frequencies 
of 3 successes, 2 successes, 1 success and no successes. 
The total frequencies secured by carrying through the 
process of multiplication will be equal to the number of 
trials, for all possible outcomes are covered by the ex- 
pansion. 

Thus, when we know the probabilities in advance,! we 
may determine the probable frequencies of any given 
number of successes or failures, a fact of great significance 
in the development of statistical theory. 


A CoMPARISON OF ACTUAL AND THEORETICAL FREQUENCIES 
IN THE REALM OF PURE CHANCE 


Certain points of importance may be made clear by 
comparing some experimental results with the theoretical 


1 A distinction is generally drawn between a priori probabilities of the type 
described above, and empirical probabilities, knowledge of which is derived from 
observation or experience. As an example of the latter type we have, as the 


74173 tee 
probability that a man aged 35 will live 10 years, the ratio pinee, This is based 


upon the American Experience Table of Mortality which shows that of 81,822 
men living at the age of 35, there are 74,173 living ten years later. Chapter 
XVI is devoted to certain considerations relating to the use of such empirical 
measures. 


THE NORMAL CURVE OF ERROR 523 


frequencies given by the binomial expansion. Twelve dice 
were thrown a number of times. Each 4, 5 or 6 spot ap- 
pearing was considered to be a success while a 1, 2 or 3 spot 
was a failure. (In a typical throw we might have the fol- 
lowing spots up: 3, 1, 5, 1, 2, 4, 4, 6, 3, 2, 3,5. In this lot 
there are five successes, and the result is so tallied.) 
In a classical example recorded by W. F. R. Weldon! twelve 
dice were thrown in this way 4096 times, a success being 
defined as above. The results are recorded in column (2) 
of Table 123, and the distribution is shown in Fig. 85. By 
computation we find the arithmetic mean and the standard 
deviation of this distribution to be, respectively, 6.139 and 
712. 

Let us compare with these results those which we might 
expect from the given conditions. Twelve dice were thrown 
each time, hence we are dealing with 12 independent 
events. There were 4096 trials. Since either a 4, 5 or 6 is 


—_ 


considered a success, p = q = = 


vO! 


For the terms in the binomial expansion we have 


— - an—2h2 + n(n — 1) (n=2) n—3H3 


N — fn n—1 
(a+ b)*=a"+na""b + —_—+ irons 


In the present case we have 


12 
4096(5 af 5) 


a7 9 
Expanding 
1 12 66 220 495 792 924 792 
eve (a96 + 4096 + 4096 * 4096 * 4096 * 4096 * 4096 * 4096 


495 220 4 66 1 12 i. 
4096 ' 4096 4096 4096 ' 4096 


Completing the indicated multiplication we have the 
theoretical frequencies of the various possible successes in 


1 Cited by G. U. Yule, An Introduction to the Theory of Statistics, 5th ed., 258, 
and by F. Y. Edgeworth, Encycl. Brit., 11th ed., Vol. XXII, 394. 


524 ELEMENTARY PROBABILITIES 


4096 throws of twelve dice. These are shown in column (3) — 
of the following table. 


TABLE 123 
Comparison of Actual and Theoretical Frequencies in Dice-Rolling 
Experiment 
(1) (2) (3) 
Number of Observed Theoretical 
Successes Frequencies Frequenctes 

0 0 1 

1 a 12 

2 60 66 

3 198 220 

4 430 495 

5 731 792 

6 948 924 

7 847 792 

8 536 495 

9 257 220 

10 a 66 

11 itl 12 

12 0 1 
4096 4096 


The distribution of the theoretical frequencies is shown 
in Fig. 85. 

The advantages to be derived from being able to secure 
a standard type of distribution with which to compare 
the actual distributions found by experiment are many. We 
are able, in the first place, to determine the degree to which 
differences in the frequencies are due to chance fluctuations 
of sampling (or to a bias in the dice). We can determine 
what the true distribution would be if the test were con- 
tinued until the results of an infinite number of throws 
were recorded. The results of the experiment, standing 
alone, are limited in their significance to the particular 
throws recorded. The theoretical frequencies have no such 
limitations, applying generally where the same _ initial 
probabilities apply. This fact, that a knowledge of the 


THE NORMAL CURVE OF ERROR 525 


theoretical frequencies affords a basis for generalization such 
as is not possible from the results of experiment alone, is 
the central and most significant of these advantages. 


1600 


800 Theoretical Lotta) 
Frequency ; 


uene 
TOG. 
eo) 
Oo 


Fre 
nN 
oO 


a. S20 (35-9: 100 Ik 2 
Number of Successes 


Fic. 85.—A Comparison of Actual and Theoretical Frequencies in a Dice- 
rolling Experiment 


When we have, as in this case, a knowledge of the 
probabilities involved, it is possible to determine the arith- 
metic mean and the standard deviation of the distribution 
which will be formed by the theoretical frequencies. As a 
general expression for the mean number of successes, where 
the number of independent events and the probability of 
success are known, we have 


M =np 


Applying the present values, 


526 ELEMENTARY PROBABILITIES 


The mean, as computed from the observed frequencies, is 
6.139. 

As a general expression for the standard deviation, under 
the same conditions, we have 


g = Vnpy 


In the present case 


/ 
/ sel: 
g=V12x5x5= V8 


= 1.732 


The standard deviation, as computed from the actual 
frequencies, is 1.712. 


Tur NorMAL CurvVE OF ERROR 


We may return to a consideration of the curve in Fig. 85 
which represents the theoretical frequencies in the dice 
throwing experiments. It is a perfectly symmetrical 12- 
sided polygon, the number of sides (excluding the base) 
corresponding to the number of independent events in 
the particular problem considered. With six events we 
should have a six-sided figure, with twenty events a twenty- 
sided figure, and so on. It is obvious that, as n increases, 
the number of sides to the polygon increasing correspond- 
ingly in number, the graph representing the expansion of 
the binomial (p+ q)” approaches more and more closely 
asmoothcurve. With n infinitely large a perfectly smooth 
curve would be secured. This is the normal curve of error 
which has been plotted in Fig 87. 

A great part of the power which modern statistical 
technique possesses is derived from the detailed knowledge 
of the characteristics of this normal or Gaussian curve. 
From prepared tables of the integrals of this curve theo- 
retical frequencies may be determined much more readily 
than by the laborious method based upon the binomial 


THE NORMAL CURVE OF ERROR 527 


expansion. Several examples demonstrating the practical 
utility of such tables are presented below. 

The equation to this curve is written in several forms, of 
which 

TEED he 

is acommon type. In this equation y., the maximum ordi- 
nate, is a constant; e is a constant (the base of the Napierian 
logarithms) having a value of 2.7182818, while o represents 
the standard deviation.!. The maximum ordinate may be 
derived from the relation 


ie N 
eee 
hence the equation to the normal curve may be written 
Yy — N Qo $3 
OV 27 


The uses of this curve, and of the tables of the probability 
integral based upon the integration of this curve, are far 
too varied to be enumerated at length here. A simple 
example may serve to introduce the subject. 


An Economic APPLICATION 


The statistical division of the American Telephone and 
Telegraph Company has made a study of the annual 
message use of four-party line residence message rate 
subscribers in Buffalo. The annual messages for each of 
995 subscribers were tabulated and classified.?, The results, 
together with certain computations, appear in the following 
table. 

1 Gauss’ deduction of the error equation may be found in all standard 
works on the theory of least squares. Cf. references at end of Appendix A. 

2 “Tntroduction to Frequency Curves and Averages.” Statistical Bulletin, Sta- 
tistical Methods Series, No.1. Issued by chief statistician, American Telephone 
and Telegraph Co. 


528 ELEMENTARY PROBABILITIES 


TABLE 124 


Annual Message Use of 995 Telephone Subscribers 
Illustrating the Computation of the Moments of a Frequency Distribution 


(1) (2) (3) (4) (5) (6) (7) (8) 
Deviation 
Interval : from arbi- 
of message Mid- | Fr trary origin 
use} pont |quency| +, class-in- 
terval units 
m i a! fe’ | f@)? | fv’) f(e')4 
O- 50 25 0 -10 0 0 0 0 
50— 100 75 1 -9 -9 81 —729 6561 
100— 150 125 9 —8 —72 576 —4608 36864 
150— 200 175 19 —7 —133 931 —6517 45619 
200— 250 225 38 —6 —228 1368 —8208 49248 
250— 300 Q75 50 —5 —250 1250 —6250 31250 
300— 350 325 95 —4 —380 1520 —6080 24320 
350— 400 375 85 —3 —255 765 —2295 6885 
400— 450 425 115 —2 —230 460 —920 1840 
450— 500 475 132 -1 —132 132 —132 132 
500— 550 525 144 0 0 0 0 0 
550— 600 575 116 1 116 116 116 116 
600-— 650 625 79 2 158 316 632 1264 
650— '700 675 54 3 162 486 1458 4374 
700-— 750 725 31 “| 124 496 1984 7936 
750— 800 775 11 5 55 275 1375 6875 
800— 850 825 5 6 30 180 1080 6480 
850— 900 875 6 i 42 294 2058 14406 
900— 950 925 2 8 16 128 1024 8192 
950-1000 975 i 9 9 81 729 6561 
1000-1050 1025 u 10 10 100 1000 10000 
1050-1100 1075 1 11 11 121 1331 14641 
995 —956 9676 —22952 283564 


1 As here classified an item having a value of 50 was put in the class having 50 
as an upper limit. Items falling on other class limits were similarly disposed of. 


Toe Moments ofr A FREQUENCY DISTRIBUTION 


Certain terms and symbols which have not been em- 
ployed heretofore may be introduced at this point. We 
may write 


THE NORMAL CURVE OF ERROR 529 


Z f(x’) eee ' 
y= = glia first moment of the distribution about the arbi- 
trary origin. 
> f(x’)? 
= te = second moment of the distribution about the 
arbitrary origin. 
Zz f(x’) ae . 
V3 = i & third moment of the distribution about the arbi- 
trary origin. 
> f(x’)4 
4= ie) = fourth moment of the distribution about the 


arbitrary origin. 


“Moment” is a familiar mechanical term for the measure 
of a force with respect to its tendency to produce rotation. 
The strength of this tendency depends, obviously, upon the 
distance of the point at which the force is exerted from the 
origin. The term is used in statistics in a quite analogous 
sense, the class frequencies being looked upon as the forces 
in question. The distance of each class mid-point from the 
origin is the factor of prime importance in this respect. 
The moments of a distribution about any origin may be 
computed by multiplying the frequency of each class by 
a given power of its distance, along the z-axis, from the 
origin, summing the resulting products and dividing by 
the number of cases. If the first moment is desired, the 
first power of the z-distance is employed; if the fourth 
moment, the fourth power of the x-distance, etc. The 
subscripts indicate the moments represented by the various 
symbols. 

The most significant moments, for statistical purposes, 
are those which relate to the arithmetic mean as origin. 
Representing these moments by z, we have the following 
relationships: 


First moment about the mean = 7 = 0 

Second “ ge SS Maree ery ~ Pye 

chird ~“‘ a <= 3 = V3 — BYyVe + 2v3 

Fourth ‘“‘ so = V4 — 4YV3 + CV V2 — 344 


530 ELEMENTARY PROBABILITIES 


The computation of these moments from the data, as 
classified, involves the assumption that the items in each 
class can be treated as though they were concentrated at 
the mid-point of that class. It has been established that, 
under certain conditions, calculations made on this assump- 
tion are subject to a constant error. In particular, it has 
been shown that the values of the second and fourth 
moments are not the same, when computed from grouped 
data, as when computed from ungrouped data. 

W. F. Sheppard! has worked out certain corrections for 
this bias. His corrections may be applied when two 
conditions prevail: 

(1) When the distribution relates to a continuous variable. 

(2) When the frequency curve is characterized by “high 
contact’’, i.e., when the frequency curve tapers off gradually 
in both directions. 

The symbol yw is employed to represent a corrected 
moment about the mean. The application of Sheppard’s 
corrections gives us the following final formulation: 


Ii = 0 

1 
Me = 72 j2 
Ms = Tis 

1 a 
Bow Wa har oan 


(In applying the corrections 1/12 and 7/240, the correspond- 
ing decimal values, .083333 and .029167, will generally 
be employed.) It is assumed in making these corrections 
that a class-interval unit has been employed in measuring 
deviations from the mean. 

It may be noted in passing that the standard deviation is 
the square root of the second moment about the mean. For 
the uncorrected value, 


G=VT. 


1 Cf. Proceedings of the London Mathematical Society, Vol. XXIX, 353-380. 


THE NORMAL CURVE OF ERROR 531 


If Sheppard’s corrections are to be applied 
og = Vite 


The calculation of the moments of the frequency dis- 
tribution of telephone subscribers is shown below. Shep- 
pard’s corrections are applied, since the curve is marked 
by reasonably high contact. It is a discontinuous distri- 
bution, but the unit (1) is so small in comparison with the 
range that it may be treated as continuous. 


Vy = — = — .960804 

Vo = ae = 9.724623 

V3 = — 23.067337 
Vs= — = 284.988945 
7, = 0 


We = Vo — vy? = 9.724623 — .923144 = 8.801479 
3 = V3 — 3VyVq + 2v)3 = — 23.067337 + 28 .0303'70 
— 1.773922 = 3.189111 
Ws = V4 — 4,3 + GY2V2 — 3Y44 
= 284.988945 — 88.652760 + 53.863384 — 2.556586 


= 247 642983 

Ju = 0 

Me = T2 — i = 8.801479 — .083333 = 8.718146 

M3 = 73 = 3.189111 

Ma = Ts — 5m + a = 247 642983 — 4.400739.+ .029167 
= 243 271411 


CRITERIA OF CurvE TYPE 


Having these values, we may return to a consideration 
of the main problem, the utilization of our knowledge of 
the normal curve. There are certain criteria which enable 


532 ELEMENTARY PROBABILITIES 


us to determine readily whether a given distribution may 
be described by a curve of the normal type. These may be 
derived from the corrected moments of the given distri- 


bution. ; 
uz  10.170429 


ogi = .01534853 
Br Ls  662.632015 
a EUAN a ee 
pa bi? 76006070 eC ee 
Bi(B2 + 3)? 


2 = 4(4B2 — 3B:) 2B: — 36; — 6) 
01534853 x 38 .4484'70 m .§901275 
~ 4(12.756686) (.355320) 18. 130823 


Ke = .032548 


For the normal curve these criteria have the following 
values 


B, = 0 
Bo = 8 
Ke = 0 


We may conclude, tentatively, that the normal curve 
may be used to describe the given distribution.! 


Firtinc 4 NormMau Curve; Use or TABLE oF 
ORDINATES 


The process of fitting the normal curve has been simpli- 
fied by the preparation of tables which express ordinates 
of such a curve as fractional parts of the maximum ordinate, 
yo. The following very short table will serve to indicate the 
nature of the procedure, though it is not detailed enough 
to employ in the actual computations.’ 

1 Tt is possible that a better fit would be obtained with a Pearsonian Type IV 
curve, but, as will be seen, a reasonably good fit is secured with the normal curve, 
and this curve has an additional advantage in the matter of simplicity. 


2 A comprehensive table of ordinates of the normal curve, in terms of 
abscissa, will be found in Pearson’s Tables for Statisticians and Biometricians. 


THE NORMAL CURVE OF ERROR 533 


TABLE 125 
Ordinates of the Normal Curve 


Expressed as Fractional Parts of the Maximum Ordinate 


x/o Y/Yo x/o Yy/Yo 
0.0 1.00000 2.5 04394 
0.1 99501 2.6 03405 
0.2 98020 2.7 02612 
0.3 95600 2.8 01984 
0.4 92312 2.9 01492 
0.5 88250 3.0 01111 
0.6 83527 3.1 00819 
0.7 78270 3.2 00598 
0.8 72615 356 00432 
0.9 66689 3.4 00309 
1.0 60653 3.5 00219 
eae 54607 3.6 00153 
1.2 48675 3.7 00106 
13 42956 3.8 00073 
ee 37531 3.9 00050 
1.5 32465 4.0 00034 
1.6 . 27804 4.1 00022 
tL? 23575 4.2 00015 
i: 19790 4.3 00010 
L9 16448 4.4 00006 
2.0 13534 4.5 00004 
2.1 11025 4.6 00003 
2.2 08892 4.7 00002 
2.3 .07100 4.8 00001 
2.4 05614 4.9 00001 

5.0 00000 


In the column z/o¢ deviations from the mean are expressed 
in units of the standard deviation. In the column y/y, 
ordinates of the normal curve at given distances from the 
mean are given as fractional parts of the maximum ordinate. 
Thus the ordinate erected at a distance of one standard 
deviation from the mean is equal to .60653y, while an 
ordinate distant from the mean by 2.8¢ is equal to .01984y.. 


534 ELEMENTARY PROBABILITIES 


In fitting a normal curve to a given distribution, therefore, 
it is necessary, first, to express the deviations from the 
mean in terms of ao; second, to determine the value of the 
maximum ordinate, and finally, using such a table as the 
one above, to compute all other necessary ordinates. The 
value of the maximum ordinate is secured from the relation 


be N 
Ye OV ir 
or, substituting the value of V2, from 
N 


Yo = 9 5066280 


In the present problem 


N = 995 
and 
og = Vin 
= V8.718146 
= 20953 
Hence 
995 
Yo = 2 506628 x 2.953 
= 134.42 


The value of the arithmetic mean is 476.96. 

The subsequent calculations appear in the table on 
the following page. : : 

In the application of the formula for y. and in carrying 
through the calculations shown in Table 126, the standard 
deviation should be expressed in class-interval units. This 
is necessary in order that the computed frequencies may be 
comparable with the actual frequencies, as tabulated by 
classes. 


THE NORMAL CURVE OF ERROR 535 


TABLE 126 


Computation of Ordinates of Normal Curve Fitted to Frequency Distribution 
of Telephone Subscribers 


Number Deviation from Computed 

of M (in class x ¥y frequency 

messages intervals) o OF (ordinates) 

m x Ye 

25 —9.0392 —3.06 .00926 1.24 
75 —8.0392 —2.72 02474 3.33 
125 —7.0392 —2.38 .05888 7.90 
175 —6.0392 —2.05 . 12230 16.44 
225 —5.0392 —1.71 . 23176 31.15 
Q75 —4.0392 —1.37 .89123 52.59 
325 —3.0392 —1.03 . 58834 79.08 
375 —2.0392 — .69 . 78817 105.95 
425 —1.0392 — .35 . 94055 126.43 
475 — .0392 —.01 .99995 134.41 
525 . 9608 .32 . 94856 Nf 7A 
575 1.9608 .66 .80429 108.11 
625 2.9608 1.00 .60653 81.53 
675 3.9608 1.34 A0TAT 54.77 
725 4.9608 1.68 24385 32.78 
775 5.9608 2.02 . 13000 17.47 
825 6.9608 2.36 06174 8.30 
875 7.9608 2.70 .02612 3.51 
925 8.9608 3.03 .01015 1.36 
975 9.9608 Dou! .00342 46 
1025 10.9608 3.71 .00103 14 
1075 11.9608 4.05 . 00027 04 
994.71 


TESTING THE GOODNESS OF FIT 


The actual distribution and the normal curve, as fitted 
by the above computations, are plotted in Fig. 86. It is 
apparent, by inspection, that the normal curve gives a 
fairly good fit to the data, although there are several 
points at which the. differences are marked. A natural 
question arises as to the reason for the failure of the normal 
curve to fit at all points. There are two possible answers 


536 ELEMENTARY PROBABILITIES 


to such a question. The failure to fit may be due merely 
to chance fluctuations such as are found in any sample. 
We may have an underlying law of distribution of residence 
subscribers, classified by message use, which accords per- 
fectly with the normal law of error, but the particular 
sample selected may be marked by certain irregularities 


CET se ee 
eee 


140 


‘i Sen Re 
100 /} 

ef ER ha 
= 
if 60 


0 100 200 300 400 500 600 700 800 900 1000 
Number of Messages 


Fig. 86.— Illustrating the Fitting of a Normal Curve to Frequency Distri- 
bution of Telephone Subscribers, Classified According to Message Use 


which would be ironed out if a very large number of cases 
were included. On the other hand, the differences may be 
due to a fundamental failure of such a distribution to 
accord with the normal law of error. Such a law may not 
describe the distribution of telephone calls, in which case 
the normal curve should not be employed. 

Which of these explanations accounts for the differences 
noted? We may attempt to answer. this question in two 
ways. We may consider the differences between actual 
and computed frequencies at several points where these 


THE NORMAL CURVE OF ERROR 537 


differences are greatest, and determine whether these 
differences are such as might be due to the fluctuations 
arising in random sampling; or we may seek to determine 
whether, for the curve as a whole, the combined discrep- 
ancies between observed and computed frequencies are such 
as might arise in random sampling, if the underlying 
distribution of telephone calls actually accords with the 
normal law of error... Both these methods may be em- 
ployed. 

It is necessary, however, to compute more exactly the 
theoretical frequencies. The ordinates of the normal curve 
do not measure these frequencies with sufficient accuracy. 
The ordinate at the middle of an interval one unit wide 
would accurately measure the area within that interval 
only if the curve which marked its upper boundary were 
straight. This is not true with respect to the segments 
under the normal curve. Accordingly, it is necessary to 
determine the theoretical frequencies by measuring areas 
instead of ordinates. 


Tuer DETERMINATION OF THEORETICAL FREQUENCIES 


The entire area under a given frequency curve is taken 
to represent the total number of frequencies. Given 
information as to the proportion of the total area within a 
given segment, it would be easy to compute the frequencies 
represented by this segment. Fortunately such informa- 
tion is to be had, with respect to the normal curve, from 
prepared tables of the probability integral, of which the 
table on page 538 is an example. A much more detailed 
table is needed for accurate computation.! 

Since the normal curve is symmetrical about the maxi- 
mum ordinate, the values given below apply both to positive 
and negative values, as measured from the mean. 


1 Detailed tables of the probability integral as calculated by Dr. W. I’. Sheppard 
have been widely circulated. Pearson’s Tables for Statisticians and Biometricians 
contains such a table in excellent form for use. 


538 ELEMENTARY PROBABILITIES 


TABLE 127 
Area of the Normal Curve in Terms of Abscissa 


(Giving fractional parts of the total area between yo and ordinates erected 
at varying distances from yo) 


x/o a a/o a 
0.0 .00000 2.0 ATT25 
0.1 .03983 2.1 48214 
0.2 07926 2.2 48610 
0.3 11791 2.3 48928 
0.4 15542 2.4 .49180 
0.5 19146 2.5 49379 
0.6 22575 2.6 49534 
0.7 25804 2.7 49653 
0.8 28814 2.8 A97T4A 
0.9 31594 2.9 .49813 
ey riey 3.0 49865 
3.1 49903 
1.1 36433 
3.2 49931 
1.2 38493 
3.3 49952 
1.3 40320 
ia eye 3.4 49966 
; 3.5 49977 
1.5 43319 3.6 49984 
1.6 44520 siz 49989 
ey 45543 3.8 49993 
1.8 46407 3.9 49995 
1.9 47128 4.0 49997 


In using such a table, deviations from the mean are 
first expressed in terms of the standard deviation. The 
proportion of the total area lying between any two ordinates 
may then be readily determined. For example: What pro- 
portion of the cases in a normal distribution lies between 
the maximum ordinate and an ordinate erected at a dis- 
tance from the mean equal to+ ¢? Reading down the 2/o 
column to 1.0, we find the value .34134 opposite it. This, 
in ratio form, is the proportion of cases falling within the 
limits indicated. Expressing this ratio as a percentage, we 
have 34.134 per cent as the answer to our question. 

Fig. 87 shows the relation of this area (the shaded area A) 
to the total area under the curve. 


THE NORMAL CURVE OF ERROR 539 


What proportion of the total number of cases in a normal 
frequency distribution will fall between an ordinate erected 
at a distance from the mean equal to — 1.40 and one 
erected at — 20? From the table we find that 41.92 per 
cent of the total area will lie between y, and the ordinate 
at — 1.40; 47.73 per cent will lie between y, and the 


ESS 


456 30 -26 -O 0 FO= 426.) 430. Wit4O 
Fic. 87. — An Illustration of the Measurement of Areas Under the Normal Curve 
ordinate at — 2a. The difference, 5.81 per cent, will fall 
between the ordinates at — 1.40 and at — 2¢6. This may 
be converted into actual frequencies by taking this pro- 
portion of the total number of cases in the given dis- 
tribution. The shaded segment B in Fig. 87 represents 
the area thus marked off. 

The procedure to employ in computing theoretical fre- 
quencies from such a table of areas will be clear from these 
examples. It is necessary only to find the area between 
the maximum ordinate and ordinates erected at the various 
class limits. By the simple process of subtraction the 
area within each class, and hence the theoretical frequencies, 
may be computed. The procedure is illustrated in the fol- 
lowing table, relating to the distribution of telephone sub- 
scribers. 


540 ELEMENTARY PROBABILITIES 


TABLE 128 


Illustrating the Computation of Theoretical Frequencies from a 
Table of Areas 


(1) (2) (3) ie o " (5) 
ee ; umber o 
D fie a Oey cases between 
Class ee and ene: yo and ordi- | Theoretical frequencies, 
limat z S nate by classes 
- hee at = 
Oo oO C 
0 —3 .23 -4993810 496 .88 
50 —2.89 .4980738 495 .58 0-50 1.921 
100 —2.55 -4946139 492.14 50-100 3.44 
150 —2 22 .4867906 484 .36 100-150 Wa) 
200 —1.88 -4699460 467 .60 150-200 16.76 
250 —1.54 .4382198 436 .03 200-250 Sou 
300 —1.20 . 3849303 383.01 250-300 53 .02 
350 — .86 - 3051055 303.58 300-350 79 .43 
400 — .52 . 1984682 197.48 350-400 106.10 
450 —.18 .0714237 71.07 400-450 126.41 
500 +.16 .0635595 63 .24 450-500 134.31 
550 + .495 . 1896931 188.74 500-550 125 .50 
600 + .83 . 2967306 295 .25 550-600 106.51 
650 +1.17 . 3789995 Sieh 600-650 81.85 
700 +1.51 4344783 432 .31 650-700 55.21 
750 +1.85 -4678432 465 .50 700-750 33.19 
800 +2.19 4857379 483 .31 750-800 Teo 
850 +2 .53 4942969 491.83 800-850 8.52 
900 +2 .87 4979476 495 .46 850-900 3.63 
950 +3 .20 .4993129 496 .82 900-950 1.36 
1000 +3 .54 4997999 497 .30 950-1000 A8 
1050 +3 .88 4999478 497 .45 1000-1050 15 
1100 | = =—-+4.22 4999878 497 .49 greater than 
1050 .05 
995 .00° 


1 The theoretical distribution shows .62 of a case below -3.230. Since this 
would be meaningless in the present example, this amount is added to the theo- 
retical frequency between 0 and 50. 


Tue STANDARD ERROR OF SAMPLING AS A TEST OF 
GooDNEss oF Fit 
Now that we have this more accurate compilation of 
theoretical frequencies we are prepared to test the results 


THE NORMAL CURVE OF ERROR 541 


of the fitting process to ascertain whether the normal 
curve gives a reasonable fit. We may first take the greatest 
differences between the observed and the theoretical fre- 
quencies, and determine the significance of such discrep- 
ancies. As we have seen, the standard deviation may be 
computed, when the probabilities are known, from the 


formula rs 
o = Vnpq 


where p refers to the probability of success and q to the 


probability of failure. In dealing with a frequency distri- 


bution we may replace p by £( f representing the theo- 


retical frequency at a selected point on the x-scale and 

N the total number of cases). In this case q will be equal 

to ath 
N 

q in the general formula for the standard deviation, we 


have 
oh OH 


? \/ fN =f) 
N 
This measure is called the standard error of sampling. 

The points on the scale, at which the differences between 
the observed and theoretical frequencies are the greatest, 
are the following: 

(Theoretical frequencies are represented by f and ob- 
served frequencies by f,.) 


Inserting these measures in place of p and 


m fo "i fis aw 
325 95 78.63 +16.37 
375 85 106.90 —21.90 
525 144 125.50 +18.50 
775 11 17.81 —6.81 


For the standard error of sampling in the first case we have 


#3 ype — 78.63) _ 
Or 995 = 


8.51 


542 ELEMENTARY PROBABILITIES 


Computing similar measures for the three other classes, we 
secure the following values: 


o; for class with midpoint 325 = 8.51 


(Re ce 66 66 66 375 = 9.73 
ea cog - > 525 = 10.47 
Os “eé 66 “ce ee 175 = 4.18 


We may now answer in a definite way the question as 
to whether the discrepancies between observed and theo- 
retical frequencies would be possible if the data relating 
to telephone calls actually followed the normal law of error 
in its distribution. The difference in the case of the class 
with midpoint 325 is + 16.37; o, for this class is 8.51. 
The deviation is + 1.9 times o,. How often would such 
a deviation occur in actual practice if the underlying dis- 
tribution were really normal? 

We may answer this question by referring to the table 
of the probability integral. (Table 127.) Opposite the 


value <= 1.9 we find .47128. This means that approxi- 


mately 47 per cent of all the cases will lie between the 
maximum ordinate and an ordinate at 1.9¢ from the mean. 
Approximately 94 per cent of the total area lies between 
ordinates erected at + 1.9o0 from the mean, hence the 
chances are about 6 out of 100 that a given value will 
differ from the mean by more than 1.9¢. This particular 
deviation, then, is one which might be accounted for by 
mere chance fluctuations of sampling, and does not prove 
that the normal curve is unsuited to the present distri- 
bution. 

Similar tests may be applied to the differences found in 
the other classes. The greatest is that in the class with 
a mid-value of 375, in which the difference of — 21.90 repre- 
sents 2.2¢0,. Reference to the tables shows that a negative 
deviation of this amount or more would occur about 1.5 
times out of 100. Counting both positive and negative 
deviations, such a difference between observed and theo- 


THE NORMAL CURVE OF ERROR 543 


retical frequencies might be expected about 3 times out of 
each 100 trials. We may conclude, then, that even the 
greatest discrepancy actually found in the present example 
is not so large as to prove that the normal curve is in- 
applicable in describing the distribution of residence sub- 
scribers on the basis of message use. 


Tue Cur-Square Test or Goopness or Fit 

It is obvious that a somewhat more conclusive result 
would be secured if we took account of all the classes and 
all the differences, instead of confining ourselves to the 
most exceptional cases. Karl Pearson has evolved such a 
measure, X?, which enables us to determine the goodness 
of fit of a given type of frequency curve, when all the data 
are taken into account. Having the theoretical and ob- 
served frequencies, this measure is computed quite readily, 
as is demonstrated in the following example: 


TABLE 129 
Computation of x? for Testing Goodness of Fit 
Normal Curve of Error Fitted to Distribution of Telephone Subscribers 
(1) (2) (3) (4) (5) 


Class Observed Theoretical (f, -f) (fo—f)? 
limits frequency frequency F f 
fo fe 

150 and less 10 13.14 —3.14 15 
150-200 19 16.76 +2 24 .30 
200-250 38 31.57 +6 .43 1.31 
250-300 50 53.02 —3.02 17 
300-350 95 78.63 +16.37 3.41 
350-400 85 106.90 —21.90 4.49 
400-450 115 126.41 —11.41 1.03 
450-500 132 134.31 —2.31 04 
500-550 144 125.50 +18.50 2 (2 
550-600 116 106.51 +9.49 .85 
600-650 79 81.85 —2.85 .10 
650-700 54 65.21 —1.21 .03 
700-750 31 33.19 —2.19 .14 
750-800 11 igfgteHl —6.81 2.60 
More than 800 16 14.19 +1.81 23 


995 995.00 15 Groups x?= 18.18 


544 ELEMENTARY PROBABILITIES 


It will be noted that in the construction of this table 
the three classes at the lower end of the distribution have 
been lumped into one, and that the same thing has been 
done with the six classes at the upper end of the distribution. 
This is done to avoid unduly magnifying slight differences 
between the tails of the observed and theoretical distri- 
butions. 

The value of x? furnishes us with the required measure 
of goodness of fit, though its interpretation necessitates the 
use of detailed tables.!. The following extract from such 
a table will indicate the procedure: 


Values of P for given values of x? and n’ 


ue n= 14 n = 15 n' = 16 
16 . 249129 313374 . 382051 
We . 199304 . 256178 .318864 
18 . 157520 . 206781 . 262666 
19 . 123104 . 164949 . 2138734 
20 .095210 .130141 . 171932 


In this table x? represents 2( ie ir) 


and n’ represents the number of classes. In the present 
case 


2 
x, 
n 


oll 


18.18 
15 


For n’ = 15 and x? = 18, the value of P is .206781; for n’ 
= 15 and x?=19, P= .164949. Interpolating for the 
given value of x? we find that P = .199. . 
P measures the probability that, if the real distribution 
were in accord with the normal law of error, we should 
obtain a fit as bad or worse than that actually secured. 
That is, in 199 out of 1000 trials, or about 1 out of 5, we 
should secure a fit as bad or worse than the one we have, 
assuming the underlying distribution to be normal. These 
are not at all excessive odds, and we may conclude that the 


1 These tables, as worked out by W. Palin Elderton, are to be found in Karl 
Pearson’s Tables for Statisticians and Biometricians, 26-30. 


THE NORMAL CURVE OF ERROR 545 


normal curve gives such a reasonably good fit that it may 
be employed as a means of describing the given distribution. 

Having reached this conclusion, all the information which 
is had with respect to distributions following the normal 
law of error is immediately available in the study of the 
given distribution. In using the original frequency table, 
we are limited to the classes there established. We may 
now go beyond this and determine how many cases may 
be expected between any two values on the z-scale. We 
may compute the probability of a case falling anywhere on 
the scale, or above or below any given value. And not 
only is our knowledge of details greater. In so far as we 
are assured of the representative character of our sample 
we have a basis for generalization which would be afforded 
by no amount of study of the particular distribution as a 
thing apart. 


Nove oN THE DESCRIPTION OF THE FREQUENCY 
DISTRIBUTION 
With the aid of the criteria explained in this chapter it is 
possible to describe a frequency distribution somewhat more 
accurately, in some respects, than could be done with the measures 
employed in the earlier chapters. A treatment of this subject is 
beyond the scope of the present book, but it seems advisable to 
indicate briefly the nature of these additional measures. 
' The value of 2 serves as a measure of the degree of “flat- 
toppedness” found in a given curve. If B2 = 3, as in the normal 
type, the curve is said to be mesokurtic. If 6, <3 the curve is 
platykurtic, or flatter than the normal type. If 62 >3, as in the 
example given above, the curve is leptokurtic, or more peaked 
than the normal. 


1 As was stated, the normal curve is but one type of frequency curve, though 
one of basic importance. A comprehensive system of frequency curves is that 
associated with the name of Karl Pearson, who has derived equations to and has 
described in detail a number of standard types. An account of other fundamental 
types will be found in the books by Arne Fisher referred to at the end of this 
chapter. A treatment of this subject in detail is beyond the scope of the present 


work. 


546 ELEMENTARY PROBABILITIES 


A measure of skewness which is more accurate than those 
given early in the book may also be computed from these measures. 
Karl Pearson has shown that the quantity 


_ _VBi(B2 + 8) 
X = 9(5B. — 6B: — 9) 


serves as a measure of the degree of asymmetry of a given curve.} 
Inserting the values of 6; and B2 given above we have, in the 
case of the distribution based on message use, 


xX = — .05558 
(x is positive if the mean is greater than the median, negative 
if the mean is less than the median. In the present case the value 
of the mean is 476.96, that of the median is 482.39, hence the 
skewness is negative.) 


Finally, the distance, d, between the mean and the mode may 
be determined from the relation 


d=xXo 


In the distribution described above (relating to telephone use) 
o, in original units, equals 147.65. Hence 


= — .05558 x 147.65 = — 8.21 
Since 
M, — M = d 
we have 
M, = 476.96 + 8.21 = 485.17 
This gives a truer approximation to the modal value than any 
of the methods discussed in Chapter IV. 


REFERENCES 


Bowtey, Artuur L. Elements of Statistics (259-286). 
Brunt, Davip. The Combination of Observations (11-28). 
Carver, H. C. Frequency Curves (In Rietz, H. L., Handbook of 
Mathematical Statistics, 92-119). 
Experton, W. P. Frequency Curves and Correlation. 
Fisher, Arne. An Elementary Treatise on Frequency Curves. 
The Mathematical Theory of Probabilities. 


1 x, the measure of skewness, is quite distinct from x, which is computed in 
testing goodness of fit. The two measures should not be confused. 


THE NORMAL CURVE OF ERROR 547 


Kewiey, Truman L. Statistical Method (94-108). (The Kelley- 
Wood Table of the Normal Probability Integral is given 
as an appendix.) 

Peart, Raymonp. Medical Biometry and Statistics (220-263). 

Prarson, Karu. Tables for Statisticians and Biometricians. 
(The Introduction to these tables will be found particularly 
useful.) 

SuepparpD, W. F. On the Calculation of the Most Probable Values 
of Frequency Constants for Data Arranged According to 
Equi-distant Divisions of a Scale. Proceedings of the London 
Mathematical Society, Vol. XXIX, 1898. 

The Calculation of the Moments of a Frequency Distribution. 
Biometrika, Vol. V. 
Yue, G. U. An Introduction to the Theory of Statistics (291-316). 


CHAPTER XVI 


STATISTICAL INDUCTION AND THE PROBLEM 
OF SAMPLING 


The preceding pages have been devoted to an account 
of the tools which are employed in statistical analysis. 
Examples illustrating the application of these tools to 
certain specific problems have been presented, but the 
emphasis throughout has been on technique. It is ap- 
propriate at this point that we stand off a distance, en- 
larging our perspective, and consider certain general prob- 
lems relating to the application of these tools. What is 
their proper place in economic and business research? 
What are the assumptions involved in using them and what 
are their limitations? These are questions which cannot 
be ignored, though detailed treatment is not possible 
within the limits of the present volume. 


STATISTICAL DESCRIPTION AND STATISTICAL 
INDUCTION 


In approaching this subject we must first make clear 
the distinction between statistical description and statistical 
induction. By employing the methods of statistics it is 
possible, as we have seen, to describe succinctly a mass of 
quantitative data. Hundreds or thousands of individual 
cases may be classified, and a frequency distribution 
formed. The essence of this distribution may be boiled 
down to perhaps four measures — of central tendency, of 
variation, of skewness and of kurtosis. A tremendous 
gain has been realized in thus replacing the multiplicity 


of individual cases by a limited number of measures which 
548 


THE PROBLEM OF SAMPLING 549 


describe the characteristics of the group as a whole. The 
possession of such tools makes it possible for our limited 
powers of perception to grasp the significance of facts in 
the mass. Again, the methods of statistics enable us to 
describe relations between variable quantities. By securing 
the equation to an appropriate curve fitted to the data by 
mathematical methods, we are enabled to determine how 
much, on the average, one quantity changes in value as 
one or more related quantities vary. This may be supple- 
mented by a measure of the scatter or dispersion about the 
fitted curve, and by a final measure, in abstract terms, of 
the degree of correlation between the dependent and the 
one or more independent variables. 

In so far as the results are confined to the cases actually 
studied, these various statistical measures are merely devices 
for describing certain features of a distribution, or certain 
relationships. Within these limits the measures may be 
used with perfect confidence, as accurate descriptions of 
the given characteristics. But when we seek to extend 
these results, to generalize the conclusions, to apply them 
to cases not included in the original study, a quite new 
set of problems is faced. 

The logical process by which general conclusions are 
drawn from a study of particular cases is termed induction, 
as opposed to deduction, which involves the drawing of 
conclusions from certain general principles. By statistecal 
induction or statistical inference is meant the generalization 
of statistical results, on the assumption that a given 
statistical measure may be taken to apply to a larger 
group than that from which it was actually derived. We 
are employing this procedure constantly in practical statis- 
tical work, though not always with a full realization of the 
assumptions inherent in that process and of the limita- 
tions which attach to it. These assumptions and limita- 
tions may be briefly considered. 


550 STATISTICAL INDUCTION 


Tur GENERALIZATION OF STATISTICAL RESULTS 


The problem at issue in considering the validity of 
statistical induction may be put in the following form: 
A statistical measure — an average, a frequency ratio, a 
coefficient of correlation — has been derived from the study 
of certain data drawn from a large population. (The 
term “population” refers to a group of things or phenomena 
having, presumably, certain characteristics in common.) 
May we assume that, if additional samples were taken 
from the same population, the corresponding measures 
would have the same values? If not, may we determine, 
from the results secured in the given case, the approximate 
limits to the fluctuations to be expected in these measures 
as derived from successive samples? Here, obviously, is 
a problem of supreme importance. Karl Pearson has 
called it ‘“‘the fundamental problem of practical statistics.” 
If we cannot be assured of a certain degree of stability in 
the results secured from successive samples it would be 
quite invalid to generalize from the study of a limited 
number of cases. No weight would attach to any study 
except one which covered the entire universe of things or 
phenomena composing the given population. Yet such all- 
inclusive studies of economic phenomena are practically 
impossible. Index numbers of prices, of wages, of living 
costs, equations describing the relation between the pro- 
duction and price of given commodities, coefficients of 
correlation between temperature and crop yield — all must 
of necessity be based on the study of samples. 


Tue ASSUMPTION OF UNIFORMITY IN NATURE 


The statistical measures secured from successive samples 
might be assumed to be stable if the validity of two prior 
assumptions be granted. The first of these is, in general 
terms, an assumption of uniformity in nature, the assump- 
tion that there is, in nature, a “limitation to the amount 


THE PROBLEM OF SAMPLING 551 


of independent variety.’’ When dealing with quantitative 
data this uniformity in nature appears in the stability of 
large numbers, as exemplified by the curious regularity 
in such phenomena as birth rates or death rates. Nature, 
in other words, is not marked by utter chaos; principles 
of regularity, order and stability appear in all natural 
processes, and these principles are strongly evident when 
we deal with masses of quantitative data. Therefore, when 
we generalize such a measure as an index number of whole- 
sale prices, we do so on some such assumption as this: 
It is reasonable to suppose that, in the larger population 
to which this result is to be applied, there exists a uni- 
formity with respect to the characteristic or relation we 
have measured. As a result of this uniformity, we should 
expect statistical measures derived from successive samples 
drawn from this population to fluctuate within definite and 
assignable limits, which we may approximate in advance. 

It is evident that in making this assumption, in saying 
“Tt is reasonable to suppose ... ,” we are introducing 
an hypothesis which is incapable of complete verification 
by purely statistical methods. There is, thus, in every 
statistical induction, an a priort element. The statistical 
conclusion can never stand completely on its own feet. 
It must be endorsed by reason and judgment if it is to 
carry conviction. If a high positive coefficient of correla- 
tion were secured from the study of a sample relating to 
banana importations and the suicide rate, this would not 
furnish convincing evidence of a causal relation, or a 
relation of contingency, between these two variables. 
There would be no reasonable basis for assuming that, in 
the larger universe of phenomena from which the sample 
was drawn, there would be uniformity with respect to 
this relationship. 


552 STATISTICAL INDUCTION 


Tue Necessity oF A REPRESENTATIVE SAMPLE 


The second condition upon which the validity of this 
type of reasoning depends is implicit in the preceding 
argument. ‘This is the assumption that the sample from 
which our first results were derived is thoroughly repre- 
sentative of the entire population to which the results are 
to be applied. The securing of such a representative 
sample is a first condition of valid statistical induction. 

How may a representative sample be secured? It is 
generally laid down as essential that the members included 
in the sample shall be random members of the population 
at large, that in the selection of the sample there shall be 
present no element of preference or bias which would tend 
toward the inclusion or exclusion of certain members of 
the larger group. Each member of the population should 
have an equal chance of inclusion in the sample on which 
the induction is to be based. This general requirement, 
as J. M. Keynes has pointed out, should be interpreted 
to mean that with respect to the generalization in question 
the members of the sample should be random members of 
the population at large. Great care is generally needed in 
securing a purely random selection. The obvious procedure 
of picking the most readily available cases would by no 
means meet the condition of random selection. Certain 
important elements in the universe of facts to which the 
conclusions are to be applied may be excluded through the 
play of an unconscious bias unless careful attention is 
given to the selection of cases. 


CONDITIONS OF SIMPLE SAMPLING 


G. U. Yule! has named certain other conditions which 
are tacitly assumed in deducing formulas relating to the 
sampling process. The assumptions upon which the process 
of “simple sampling”’ rests are: 


1 Introduction to the Theory of Statistics, 5th ed., 259-261. 


THE PROBLEM OF SAMPLING 553 


1. An assumption of complete independence between the 
several events composing the sample. Thus to the extent that 
a change in the price of one commodity is affected by a change 
in the price of another, the two price changes are not independent, 
and their inclusion in a single sample violates the conditions of 
simple sampling. 

2. An assumption that there is no essential difference between 
the localities from which the members of the sample are taken, 
and that there has been no essential change in underlying con- 
ditions during the period of time covered by the observations. 

3. An assumption that “the conditions that regulate the 
appearance of the character observed (are) not only the same 
for every sample, but for every individual in every sample.” 


When all these conditions of sampling have been ob- 
served, it is possible to assign in advance limits within 
which we may expect statistical measures derived from 
different samples of the same population to fluctuate. 
This means that we may apply to the population at large 
statistical measures secured from the study of a sample 
not with confidence in their perfect stability, but with 
fairly definite knowledge of the margin of error involved 
in thus extending our results. Where the necessary con- 
ditions are fulfilled statistical induction is a valid pro- 
cedure. 


Tuer SIGNIFICANCE OF MEASURES OF RELIABILITY 


The nature of the measures of stability which we seek 
should be clearly understood. When we generalize from 
a knowledge of certain statistical quantities we do so by 
applying to a larger group the value for mean, standard 
deviation, or coefficient of correlation which has been 
computed from a sample. And we wish to determine the 
limits within which these results would fluctuate if com- 
puted from a number of different samples drawn from the 
same population. A measure of these limits will serve as 
a measure of the reliability of the given results, when 
extended. 


554 STATISTICAL INDUCTION 


Such a measure might be secured by the laborious process 
of studying a great many different samples, just as the 
dice were thrown 4096 times. Thus we might desire to 
test the reliability of an average of weekly earnings of a 
certain class of workers. A first average might be secured 
from a sample composed of 250 individual records. ‘This 
result might be tested by computing 499 additional samples, 
each based on 250 individual records. These 500 averages 
would not be identical in value, but if they were tabulated 
a frequency distribution closely approximating the normal 
type would be secured. From this distribution we might 
compute the mean of all the averages and the standard 
deviation of these averages. This standard deviation 
would serve as a measure of the variation found in the 
average of weekly earnings, as computed from successive 
samples. 

But it is generally impracticable to take 400 or 500 
successive samples in order to test the reliability of a given 
measure. When the conditions of simple sampling, as 
set forth above, are fulfilled, it is possible to compute such 
measures of reliability more directly. 

In the preceding chapter the results of an experiment 
involving the throwing of 12 dice 4096 times were given. 
A throw of 4, 5 or 6 was counted a success. The standard 
deviation, as computed from the results, was 1.712. But 
we found it possible, knowing the probabilities involved 
and the number of dice thrown each time, to compute a 
theoretical standard deviation from the formula 


go = Vnpq 


This had a value of 1.732. 

By an analogous procedure it is possible to compute 
similar quantities which serve as measures of the stability 
of statistical results, in so far as the fluctuations due to 
sampling are concerned. The formulas for these various 
measures of reliability, or standard errors, may be given 


THE PROBLEM OF SAMPLING 555 


without attempting to indicate the processes by which 
they have been derived. The references at the end of this 
chapter may be consulted for the details of these processes. 


STANDARD ERRorS OF THE CHIEF STATISTICAL 
MEASURES 


The reliability of an arithmetic mean, computed from a 
given sample, depends upon the value of the standard 
deviation of the original data and upon the number of 
cases included in the sample. The definite relationship is 


ze othe 
EN. 
(o, without subscript, is taken to refer to the standard 
deviation of the sample). This statement expresses the 
reliability of a given result in terms of the standard error. 
The probable error is more commonly used. The probable 
error of a normal distribution, it will be recalled, is always 
.67450. Thus we have 


on 
PoE, ean = - 6745 —_—= 
a VN 


The meaning of the expression may be made clear by the 
use of an example. Assume that the mean weekly wage 
of 100,000 industrial workers in a given city is desired. 
A study including all workers is impossible, and a repre- 
sentative sample of 900 is selected. The mean weekly wage 
is found to be $27.50. May this be accepted as accurate? 
By how much would the results secured from additional 
samples be expected to differ from this value? Assuming 
that the standard deviation of the data is found to be 


$2.00, we have 


$2.00 
(PRE an. = 6745 a 
a 1/900 


= $.045 


In reporting the result of the original study the mean 
weekly wage would be given as $27.50 + .045. This is 


556 STATISTICAL INDUCTION 


interpreted to mean that if an additional sample of the 
same size were taken the chances would be even that the 
result would not differ from $27.50 by more than $.045. 

In the same way measures of the accuracy of other 
statistical measures may be secured. The following are 
the formulas used in deriving the most important of these. 
(In each case the probable error may be derived by multi- 
plying the standard error by .6745.) 

o o 


Tua = 1.25881 P.E.a = .84585 Te 
oO Oo 
Ta = 1.36263 To P.E.q1 = .91908 7 


(These measures have the same values for Q3 as for Q,.) 


Oo = 


i 
zm 


This formula for the standard error of the standard 
deviation holds exactly only when the distribution of 
observations in the sample is normal. For a distribution 
of any type the reliability of o may be determined from 


i Ma — Me 
pe V Ape: N 
The reliability of the coefficient of correlation, for a 
normal distribution, is measured by 
1l-r 
O, = 
VN 
This formula holds also for the coefficient of multiple 
correlation, coefficients of partial correlation and, approxi- 
mately, for the correlation ratio. 
The standard error of a coefficient of regression, assuming 


the distribution to be normal, is determined from the 
formula 


oVvl—-r 
o2V N 


Tn. = 


THE PROBLEM OF SAMPLING 557 


This form may be adapted to a coefficient of regression 
of any order. 
As a test for linearity we have been given 
C=7-7 

But we wish to know whether, in a given case, the difference 
between 7” and r? is due merely to a chance fluctuation of 
sampling, or is due to a real departure of the underlying 
relationship from the linear form. As the standard error 
of ¢ Blakeman has proposed 


or = /S Va ed 


The use of this measure may be illustrated with reference 
to the problem relating to wheat yield which was con- 
sidered in an earlier chapter. For the relation between 
wheat yield and amount of nitrogen used as fertilizer, 
we had 

r= + .793 
n = .964 
Therefore 
C=7 —7 = .300 


Inserting the given va.ues in the formula 
Sl a A eecear rae eT 
or = WV — gy Ped 


and solving, we have 

o¢ = .074 
With ¢ having a value of .300, about 4.05 times its standard 
error, .074, there can be no question as to the non-linearity 
of the relationship. The difference between 7? and r’ is 
one which could hardly be due to chance fluctuations of 
sampling. 

For the normal curve the standard error of x, the measure 

of skewness, is derived from the formula 

ghee 


558 STATISTICAL INDUCTION 


For the distribution relating to telephone use given in the 
preceding chapter, we have 


ox = \/—_ = .03888 
1990 


The value of x was found to be —.05558, about 1.43 times 
its standard error. This value of x, therefore, might well 
be due to fluctuations of sampling, and does not prove 
that the true distribution is other than normal. 

The standard error of the quantity d, which measures 
the distance between the mean and the mode, may be de- 
termined, for the normal curve, from the formula 


EeNe 
7a ON 


For the distribution of telephone calls this becomes 


3 
Oa= \ 1990 147.65 


= 5.733 


Since d had a value of 8.21, less than one and one-half its 
standard error, we are led to conclude that the difference 
between the mean and the mode is one which might easily 
be due to chance fluctuations of sampling. If d were more 
than three times its standard error it would indicate a dif- 
ference between mean and mode which was probably not 
due to chance alone, but represented a fundamental depar- 
ture of the distribution from the normal type. 

A practical problem of some importance is fenaene 
encountered in comparing two averages. If two samples 
are taken and the averages computed, a question arises 
as to whether the difference between these samples might 
arise from mere fluctuations in sampling, or whether it 
indicates a real difference between the two larger groups 
from which the samples are drawn. The formula em- 
ployed in determining the standard error of the difference 
between two means is 


THE PROBLEM OF SAMPLING 559 
op = V o; +92 


where oi and go» represent the standard deviations of the 
two samples, while N; and N; represent the numbers in 
the two samples. In using this measure, the actual differ- 
ence between two averages is compared with the standard 
error of this difference. If the actual difference is greater 
than three times the standard error it is highly unlikely 
that the difference is due to fluctuations of sampling. Such 
a condition would argue a fundamental difference between 
the two groups from which the samples were drawn. 

The formula for the standard deviation of sampling, 
which may be used in comparing theoretical and actual 


frequencies, is 
Ca SSE he 


Its use has been explained in the preceding chapter. 

Most of the measures of reliability presented above have 
been given in terms of the standard error rather than of 
the probable error. The latter may be derived readily 
from the former, as explained, but the standard errors seem 
more appropriate for general use. The most elaborate 
tables of the probability integral are those in which abscissal 
distances are measured in terms of the standard deviation. 
The significance of any given standard error, in terms of 
probabilities, may be readily determined from such tables. 


Limitations To Mmfasures oF RELIABILITY 


Such measures as those for which the formulas were 
given above should be used with a full understanding of 
their limitations. In the first place, their significance 
depends upon the presence of a sufficiently large number 
of cases in the sample. If N falls below 15 in any case the 
above formulas for standard errors should not be applied, 
while a much larger number of cases is advisable if much 


560 STATISTICAL INDUCTION 


weight is to be attached to the results. For the coefficient 
of correlation 25 cases may be considered a minimum. ‘The 
assumption is made in the interpretation of these standard 
errors that statistical measures secured from successive 
samples would be distributed in accordance with the 
normal law of error. When the number of cases is large 
this is approximately true, even though the original data 
are not so distributed. But with a small number of cases 
this assumption may be quite invalid, and no precise 
significance, in terms of probabilities, may be attached 
to the standard error. : 

Moreover, and this is a most important warning, the 
standard errors can be assumed to measure only errors 
arising from the fluctuations of simple sampling. This 
assumption injects certain elements of doubt into any sta- 
tistical induction. We cannot, in the first place, be sure 
that the conditions of simple sampling, as given above, 
are actually fulfilled in a given case. They are rarely, if 
ever, perfectly fulfilled in the handling of economic data. 
Secondly, the standard errors as derived above can give 
no indication of the possibility of fluctuations in successive 
samples due to causes other than those arising from simple 
sampling. Fluctuations due to bias, due to the absence of 
random selection in the sampling process, due to per- 
sistent errors of any sort, quite elude this method of de- 
termining probable stability. These precautions must be 
constantly borne in mind in using these measures if the 
results of statistical study are to be properly interpreted 
and properly applied. 

So serious are these limitations to the employment of 
the usual measures of probable error in connection with 
economic data that it would seem generally advisable to 
subordinate such measures to actual statistical tests of 
stability. By the study of successive samples, and by the 
testing of the subordinate elements in a given sample when 
broken up into significant sub-groups, much more may be 


THE PROBLEM OF SAMPLING 561 


learned as to the reliability of a given measure and as to 
the possibility of applying it generally than by unquestion- 
ing acceptance and uncritical employment of the usual 
mathematical formulas for probable errors. 


REFERENCES 

Bowtey, Artuur L. Elements of Statistics (812-342). 

Broan, C. D. On the Relation between Induction and Probability. 
Mind. N.S. Vol. 27, 1918, and Vol. 29, 1920. 

Epiroriau. On the Probable Errors of Frequency Constants. 
Biometrika. Vol. 2 (273-281). 

Experton, W. P. Frequency Curves and Correlation (131-138). 

Fisuer, ARNE. The Mathematical Theory of Probabilities. 

Kewiey, Truman L. Statistical Method (94-108). 

Keynes, J. M. A Treatise on Probability. 

Mitts, Freperick C. On Measurement in Economics (in Tug- 
well, R. G. ed. The Trend of Economics, 37-70). 

Praryt, Raymonp. Medical Biometry and Statistics (209-219). 

Pearson, Karu. The Fundamental Problem of Practical Statis- 
tics. Biometrika. Vol. 13. 

Rierz, H. L. Random Sampling (in Rietz, H. L. ed. Handbook of 
Mathematical Statistics, 71-81). 

Wuiraxker, E. T. and Rosinson, G. The Calculus of Observations 
(164-208). 

Yuur, G. U. An Introduction to the Theory of Statistics (254-290, 
335-356). 


APPENDIX A 


THE METHOD OF LEAST SQUARES AS APPLIED 
TO CERTAIN STATISTICAL PROBLEMS 


The method of least squares in the case of a single un- 
known quantity is merely a procedure for obtaining the 
most probable value of that quantity from a number of 
separate observations. The most probable value is that 
for which the sum of the squares of the deviations (or 
residuals) is a minimum. ‘This is the arithmetic mean of 
the observations. 

Where the measurements or observations do not relate 
directly to a single unknown quantity, but to functions of a 
number of unknown quantities, the problem is somewhat 
different. In the first case mentioned each observation is 
in the form of a single magnitude. In the present case 
each ‘observation is in the form of an observation equation in 
which the observed values of the variables, as found in com- 
bination, are entered. The unknown quantities are the 
constants which define the functional relationship between 
the variables in question. Our problem is that of finding 
the most probable values of these constants, the true values 
being unknown. 

As in the simpler case the most probable values are those 
for which the sum of the squares of the residuals is a mini- 
mum. In this case, however, the residuals are deviations, 
not from a single magnitude, as in the case of the arithmetic 
mean, but from the curve which describes the most probable 
functional relationship. In other words, the residuals are 
the differences between the computed and the actual values 
of the dependent variable. 

562 


THE METHOD OF LEAST SQUARES 563 


DERIVATION OF THE NORMAL EQUATIONS 


Representing by Y an observed value of the dependent 
variable, by Y, the corresponding computed value, by v the 
residual, or difference between Y and Y,, and by Wi, W2, 
W; and W, different independent variables (or different 
functions of a single independent variable), we may write 


Y.=f(Wi, W2, Ws, Ws) 
o=Y,—Y 
= f(W,, W., W3, W,) —Y 
2 (vo?) = ZL f(Wi, We, Ws, Ws) — Ay 


If the function in a particular case is of the type 


Ys = aw, -b bW, “+ cW3 + dW, 
we have 
Z(v?) = ZL (aWi + bW. + cW3 + dW,) - VS 


Our problem is that of determining the most probable 
values of the constants which define the function. These 
constants are represented, in the present case, by a, 0, c 
and d. (The W’s, it should be noted, refer to quantities 
which are known, once the observation equations are 
given. In the usual case the W’s are different functions 
of a single variable, but this is not essential.) On the 
assumption that the errors of observation are distributed 
in accordance with the normal law of error, it may be 
demonstrated that the most probable values of a, b, c 
and d, in the above equation, are those which render 
Z(v?) a minimum; Le., 


DL (aW, + bW,+ cW;+ dW,) — Y 2 = a minimum (a) 


The normal equations necessary for the solution may be 
obtained by equating to zero the partial derivatives of 
the above expression with respect to the unknowns, a, b, 
cand d. That is, we first differentiate the above function 
with respect to a, holding 6, ¢ and d constant, then with 
respect to b, holding a, c and d constant, then with respect 


564 THE METHOD OF LEAST SQUARES 


to d, holding a, b and ¢ constant. Carrying through this 
operation with respect to a, we have 
<2 (a SRW, OCW Ll eo 


or 
I SWil(aW,+bW,+cW;+ dW.) — Y]=0 


Differentiating equation (A) now with respect to b, we have 


S2L (aM: OW. Lew pad) ean 


or 
II ZW (aw, + bW + cW3 + dW 4) —— Fs =0 


Differentiating equation (A) with respect to c, 


ICUs + OW. Ws dW) — Yat 20 


or 
Il 2W(aWi + bW.+cW;+dW,) —Y]=0 


Differentiating equation (A) with respect to d, 


CLA We cl. Lah) eo 


IV SWi(aW1 + bW. + cWs + dW, —~Y]=0 


The most probable values of the quantities a, b,c and d 
are secured by solving simultaneously the four normal 
equations thus obtained (numbered above I, II, III, IV). 


ForRMATION OF THE NorRMAL EQUATIONS 


When the observation equations are all of the first 
degree (i.e., of the first degree with respect to the unknown 
quantities, a, b, c, etc.) the normal equations may be secured 
by the following process: 


1. Write the equation which describes the assumed relation- 
ship. The observation equations are derived by substituting in 
this equation the observed values of the variables, as found in 
combination. 


THE METHOD OF LEAST SQUARES 565 


2. Multiply each observation equation by the coefficient of 
the first unknown in that equation; the sum of the resulting 
equations constitutes the first normal equation. 

3. Multiply each observation equation by the coefficient of 
che second unknown in that equation; the sum of the resulting 
equations constitutes the second normal equation. 


Continue this process until normal equations equal in 
number to the unknown quantities are obtained. 

The actual process of forming the normal equations in 
curve fitting may be simplified, and the writing out of the 
separate observation equations avoided, as was demon- 
strated in earlier sections. The following may be laid 
down as general rules for the formation of the desired 
normal equations: 


1. Write the equation of the curve to be fitted. For the 
purpose of this explanation we may employ the general form 


Y =aW,+ bW.2+cW3+dWit+... (1) 
where Y represents the dependent variable, a, b, c,d . . . repre- 
sent the constants in the equation (the unknown quantities 
in the present instance) and Wi, We, Ws, W4.. . represent 


the coefficients of these unknowns. It is assumed that these 
coefficients represent variables, and that term is used with 
reference to them. Call this equation (1). 

2. Multiply each term in equation (1) by the coefficient of 
the first unknown in (1) (i.e., by W,) and place the summation 
sign, 2, before each variable. This is the first normal equation 
cD: 

3. Multiply each term in equation (1) by the coefficient of 
the second unknown (i.e., by W2), and place the summation sign 
before each variable. This is the second normal equation (II). 

4, Multiply each term in equation (1) by the coefficient of 
the third unknown (i.e., by W3) and place the summation sign 
before each variable. This is the third normal equation (III). 

5. Multiply each term in equation (1) by the coefficient of 
the fourth unknown (i.e., by W4) and place the summation sign 
before each variable. This is the fourth normal equation (IV). 


566 THE METHOD OF LEAST SQUARES 


The process may be continued until normal equations 
equal in number to the unknown quantities are obtained.’ 


A Stanparp Set oF Normaut EQuatTIoNs 
As a set of generalized normal equations secured by the 
above process and applying to any equation which can be 
put in the form 
Y =aW,4+ bW2+cW3+dWit... 
we have 
I 2(W.Y) 
= ad(W,”) + b2(WiW.) + c2(WiW3) + d2(WiW4) +... 
Il 2(W2Y) 
= ad(W,\W2) + b2(W22) + c2Z(W.W3;) + dz(WW) +. . . 
Ill 2(W3;Y) 
= 2(W.W3;) + b2(W2W3) + c2(W3?) + d2(W3W4) +... 
IV 2(W,Y) 
= a>(WiW,) = b> (W2W1) = cD (W3W 4) ae d>(W.?) Ea sae 
By substituting for W1, W2, Ws, W4, ete., the particular 
functions employed in a given case, these equations may 
be readily adapted to any type of curve in the fitting of 
which the method of least squares is applicable. Thus 
in fitting a curve represented by the equation 


Y=a+60X + cX’?+ dX3 


substitutions in the standard normal equations given above 
are based upon the following relations: 


W,=1 
W.2=X 
W; = X? 
Wa= X® 


The changes to be made in the normal equations are 
obvious. 2(WiY) becomes 2(Y); 2(W,2) is equivalent 


' These rules represent an adaptation of a similar series formulated by Raymond 
Pearl in Medical Biometry and Statistics, 341. 


THE METHOD OF LEAST SQUARES 567 


to 2(1?), which is equal to N, the total number of observa- 
tions. The first normal equation becomes 

2(Y) = Na + b2(X) + cX(X?) + dz(X8) 
The other normal equations are modified correspondingly. 

In the example just given, the coefficients are all different 
functions of a single independent variable, X. It is not, of 
course, essential to the method of least squares that this 
be so. The coefficients, Wi, W2, Ws, ete., may represent 
a number of independent variables, as in the case of multiple 
correlation. 

The limitations to the method of least squares must be 
borne in mind in making use of it. This method, in its 
direct application, is limited to cases in which the equation 
to the curve to be fitted is linear in the constants, i.e., the 
observation equations must all be linear as regards the 
unknown values, a, b, c, ete. (This does not mean, of course, 
that the equation to the fitted curve must be linear.) As 
an example of this limitation, we may cite a curve having 
as equation y = ab”, which cannot be fitted directly by the 
method of least squares. If the observation equations are 
non-linear they may be reduced to the linear form in many 
instances by the use of logarithms, and the method of least 
squares then employed. 


DERIVATION OF THE FoRMULA FOR THE STANDARD ERROR 
oF ESTIMATE 

It has been pointed out in the body of the text that the 
standard error of estimate may be derived as a by-product 
of the method of least squares. A more complete demon- 
stration of this process may be given at this point. 

When tke partial derivative of the expression 

LL (aW, + bW. + cW; + dW.) — YP =a minimum 

is equated to zero, with respect to the first unknown, 


a, we have 
ZW, (aw, +- bW, + cW3 + dW 4) — Y] =0 


568 THE METHOD OF LEAST SQUARES 
Since 
aW,+bW,2+cW3+dW,-Y =0 
we have as a necessary condition of fitting 
2 (oW,) =) 
When the partial derivative of the same expression with 

respect to 6 is equated to zero, we have 

ZW if (aW, + bW. + cW; + dW.) _— FI =0 


or, making the same substitution as in the preceding case, 
2(oW 2) =0 
Repeating the operation with respect to c and d, we may 
show that 
2(oW3) = (I) 
and 
2(oW 1) =U) 
In summary: When the method of least squares is 
employed in determining the most probable values of 
certain unknown quantities, having as known coefficients 


the quantities Wi, W2, W:, W., the following relations 
hold as a necessary condition of the least squares method: 


2(oW) =0 
D(oW 2) = 0 
Z(wWs) = 0 
DoW 4) = 0 


A knowledge of these relationships gives us a method of 
securing readily the value (v?) and the standard error of 
estimate. Assume that, by the method of least squares, we 
have determined the constants in an equation of the type — 


a = aW, + bW + cW; + dW 4 
For each residual we have the relation 
v= aW,+ bW.4+ cW3;+dW,-—Y (1) 


THE METHOD OF LEAST SQUARES 569 


Multiplying throughout by »v, and summing, we have 
(v2) = aZ(oWy) + b2(wW2) + cL(wW) + dZ(wWs) — Z(¥v) (2) 


But 


2(oW) = 0 
2(woW 2) = 
2(oW3) == (0) 
Z(wW 4) = 0 
therefore, 
2(v?) = — 2(Yo) (3) 


Multiplying each equation (1) throughout by Y, and 
adding, we have 


D(Yo) = ad(WiY) +b2(W2Y) +cB(WsY) +dd(WiY) -Z(¥) (4) 
Substituting in (3) the equivalent of 2(Yv) we have 
D(v?) =2(Y)?-a2(WiY)—-b2(W2Y)-c2X(W3Y)—-d2(WuY)_-.. . (5) 


This gives us a means of obtaining the value 2(v?) with- 
out computing the separate residuals, a method which is 
applicable whenever the equation of the curve to be fitted 
is of the form, or may be reduced, by the use of logarithms, 
reciprocals or other manipulation, to the form 

\4 = aW, + bW, 4 cW + dW 
In applying this to a particular case it is necessary only to 
replace Wi, W2, W;, W4, etc., by the functions which actually 
appear as coefficients of the unknown quantities in the 
original equation. Thus in fitting a curve the equation 
to which is 
Y=a+bX + cX?4+ dX 


we find, as noted above, that 


Wy=1 
W.=X 
Ws; = X? 


570 THE METHOD OF LEAST SQUARES 


Making these substitutions in equation (5) above we have 
D(v?) = D(¥*) — aX(Y) — b2a(XY) - ch(X*Y) —d2(Xx*Y) (6) 
The standard error, S,, is derived from the equation 


D(a) 1 
N 


Sy = 


where d is used to represent a deviation from a fitted curve. 
The deviation, d, then, is but another term for the residual 
vy. Accordingly, as a general expression for the standard 
error of Y, with W:, W2, W;, and W, as independent 
variables, we have 


DY? — aZ(WiY) — bz (W2Y) — c2(WsY) — d2(W4Y) 
Sgekges a ai Ge 
As in the previous case, this may be applied to a particular 
problem by replacing Wi, W2, W3, W4, etc., by the actual 
coefficients of the unknown quantities. 


DERIVATION OF THE FORMULA FOR THE INDEX OF 
CORRELATION 


We have adopted as an index of the degree of correlation 
between two variables the measure p (Rho), derived from 
the equation 


Palas (8) 


assuming a single dependent variable, Y, and a single inde- 
pendent variable, XY. With a single dependent variable, Y, 
and a number of independent variables, W,, W2, W3, Wa, the 
expression might be written 


1 Since our object is to measure the actual “scatter” about the fitted curve, 


the formula 


D 
) is used, rather than the formula (where N represents 


c 
the number of observations and N, the number of constants in the equation to the 


fitted curve). The second formula would be used, in accordance with the theory 
of least squares, if we were seeking to determine the mean square error of an ob- 
servation or of an observational equation. 


THE METHOD OF LEAST SQUARES 571 


S 2 
P* ywrwrwaw, = 1 — af (9) 
Corresponding changes would be made in the subscripts 
for other changes in the symbols employed. The expres- 
sion above is equivalent to 
2 (d?) 
3 “wiwewsws, = v= 
Diy D(y?) 


where y represents a deviation from an origin at the mean 
of the Y’s. But 


where Y represents the original values of the Y-variable 
and c, represents the difference between the original origin 
and the mean of the Y’s. (The symbols ¢, and c, should not 
be confused with c, one of the constants in the equation 
to the fitted curve.) 
Accordingly, we have 
D(a) 


PP y-wivrwsws = 1 — SV) — Nee (10) 


But we have secured an expression for 2(v?) (the equivalent 
of 2(d?) ) which holds in the case of a curve fitted by 
the method of least squares. Taking first the general 
case, with a number of independent variables, and sub- 
stituting the equivalent of Z(d?) in the above equation, 
we have 


Downe, =a A 
_ Z(¥*) — a2 (WiY) — b2(W2Y) — ch(WsY) — dz(WuY) - . - 


2(Y?) — Ne,? 


Simplifying this, we have, as a general formula for the 
index of correlation 


OD guvenivds . = (11) 


S(¥) — Ne,? 


572 THE METHOD OF LEAST SQUARES 


This may be applied to a specific case by replacing W,, 
W., W;, W4, etc., in the above formula by the functions 
which appear as coefficients of the unknown quantities in 
the original equation. When all these are functions of 
a single independent variable, as in the usual case, the 
index of correlation would be represented by the symbol pyz. 


CERTAIN SPECIAL CASES 


In the case of multiple correlation, where the symbols 
X,, X2, X3, X4, etc., are used to represent all the variables, 
whether considered dependent or independent, the symbol 
R is employed for the measure of correlation and numerical 
subscripts utilized as described in the body of the text. 

In the case of a straight line relationship between two 
variables, p is replaced by the symbol r, which represents 
the ordinary coefficient of correlation. As the general 
equation for r we have 

vo _ GLY) + bE(XY) = Noy? 
DO) = Nee 


There are two special cases in which this formula may be 
simplified. If the origin be at the mean of the X’s, we 
have 


NO; = 02 Y 
and the formula for r reduces to 


ge O2(eY) 
~ 2CY*) Ne? 


72 


If the origin be at the mean of the Y’s (it is not essential 
that it be also at the mean of the X’s) 


2(y) = 0, and c, = 0 


THE METHOD OF LEAST SQUARES 573 


and the formula for the coefficient of correlation becomes 
bu(Xy) 

Z(y’) 

In this latter case the general formula for p may also be 
simplified by the elimination of the terms aZ(y) and Nc,?. 


2 


CHECKS ON THE FORMATION OF THE NORMAL 
EQUATIONS 


There are so many possibilities of arithmetical error in 
the formation and solution of a set of normal equations 
that checks should be employed wherever possible. A 
convenient check on the calculations leading to the normal 
equations is afforded by the introduction in each observa- 
tion equation of an additional term, s, equal to the sum of 
all the known quantities in that equation. Thus, in the 
following system of observation equations, formed in fitting 
a lune to the points 1, 3; 2,4; 3, 6; 4,5; 5, 10; 6,9; 7, 10; 
8, 12; 9, 11, the values of s are as indicated: 


8=a4+1b 5 
4=a+2b 7 
6=a+3b 10 
§5=a+4b 10 
10=a+5b 16 
9=a+6b 16 
10=a+7b 18 
12=a+8b 21 
ll=a+9b 21 


(The coefficient of a in each case is 1, and this is added to 


the other known quantities.) 
In fitting a curve described by the type equation 


Y= aw, + bW + cW3 aa dW , 
the following relations prevail between s and the other quan- 
tities computed. For each observation equation, 
Y+Wi+W.2+W3+Wa=s 


574 THE METHOD OF LEAST SQUARES 


For the normal equations, 

S(WiY) + UW) +D(WiW) + Z(WWs) + TW) = TMs) 
(WY) +2(WiW2) + 2(W2?) + DOWWs) + Z(WWs) = Z(Wes) 
S(WsY) + ZW) + D(WWs) + TW) + D(WsWs) = DW) 
D(WaY) + ZW.) + ZT(WWs) + Z(W3Ws) + ZW) = U(Was) 


This form is capable of application to any specific problem. 
In each case the s-equations are formed in precisely the 
same way as the corresponding normal equations. 

In applying these checks several additional columns are 
needed in the working tables, but the extra trouble is more 
than compensated by the opportunity to check the work 
at each stage. The application is illustrated in the following 
working table, showing the calculations involved in fitting 
a second degree parabola of the form 

Y=a+bX + cX? 


to: the’nine points 1,25 2; 6:03, 73 4, 8; 5, 10596) 11. 7 
3311029, 9: 


Taste A 
Illustrating the Use of Checks on the Formation of Normal 
Equations 
¥ x oe 2G g xy 8 Xs X’s 
2 1 1 g Q 5 5 5 
6 2 a 12 24 13 26 52 
ai 3 9 Q1 63 20 60 180 
8 4 16 32 128 29 116 464 
10 5 Q5 50 250 41 205 1,025 
11 6 36 66 396 54 324 1,944 . 
11 a 49 raids 539 68 476 3,332 
10 8 64 80 640 83 664 5,312 
a) 9 81 81 729 100 900 8,100 
74 45 285 421 Q.771 413 2.776 20,414 


(Columns for X* and X‘ are omitted, as the values 2(X*) and D(X‘) may be 
derived from prepared tables.) 


Each of the values in the column headed s is secured 
from the corresponding observation equation. Thus, from 
the first observation equation 


THE METHOD OF LEAST SQUARES 575 
2=la+1b+ 1c 


we have 5 as the value of s (2, plus the coefficients of the 
three constants). These values of s are secured readily 
from the table by adding the figures in the columns headed 
Y, X and X’, plus 1, the coefficient of the constant term a. 

Adding the various columns, the arithmetic work is 
verified by the following checks: 


2(Y) + N + 2(X) + 2(X?) = Bs) 

T4 49445 + 285 = 418 

SAY lO ae) 4 (SS (Xs) 
421 + 45 + 285 + 2,025 = 2,776 

Z(X?Y) + D(X?) + U(X) + V(X) = B(X?s) 
2,771 + 285 + 2,025 + 15,333 = 20,414 


Further uses of a check of this kind are explained below, 
in discussing the solution of the normal equations. 


Otuer TEstTs 


The possibility of checking the calculations in other 
ways has been suggested in the preceding sections. Thus, 
where the coefficients of the constants in the equation to 
the fitted curve are represented by Wi, W2, W3, W:, we 
know that 


Z(wW) = 0 
2 (voW 2) = 
Z(wW) = 0 
Z(wW 4) = 0 


If a curve of the type 
Y =a+0X + cX?+4+ dX* 
has been fitted, this means that 


Sie)= 0 
D(vX) = 0 
Z(oX) = 0 


2(vX*) = 0 


576 THE METHOD OF LEAST SQUARES 


The accuracy of the work may be tested by checking 
these relations. 

Finally, we may test the accuracy of the work by com- 
puting the standard error of estimate in two different ways. 
We may compute the separate residuals by taking the differ- 
ence between computed and actual values of the dependent 
variable, and from these values determine S. This may be 
compared with the results secured by applying the general 
formula for the standard error, as derived above. In the 
fitting of the second degree parabola, the data of which 
were used to illustrate the method of checking the normal 
equations, the equation derived was 


Y = — .92860 + 3.52316X — .267316X? 


From the residuals separately computed we have 
S, = .4941 
From the formula 
go _ 2(h2) = aB(P) — bE(XY) ~ BY) 
te A) ee ee 
N 
we have 
See don7 


This constitutes a final check upon the accuracy of the 
calculations. 


SOLUTION OF THE NORMAL EQUATIONS 


The task of solving the normal equations is not a difficult. 
one in most of the cases presented to the economic statis- 
tician. If there are only two or three unknowns the cor- 
responding number of normal equations may be solved 
by simple algebraic methods. Even with three equations, 
however, it is advisable to employ a systematic procedure, 
and with more than three equations this is imperative. 
Such systematic methods of solving the simultaneous equa- 
tions which are met with in connection with the method of 
least squares have been worked out by Gauss and by 


THE METHOD OF LEAST SQUARES 577 


Doolittle. The latter method, which is perhaps the more 
convenient for general usage, is demonstrated below. 

The coefficients of the unknowns in the normal equa- 
tions are always symmetrical with respect to the principal 
diagonal. Thus in securing the most probable values of 
the constants in the equation 


a = aWw, + bW. + cW3 + dW 4 
we have the four normal equations 
ad(W2) + bE(WwW:) + cE(WiW;) + d2(WW) - T(WiY) = 0 
ax(WiW.) + b2(W.?) + c2(W2W3) + d2(WLW,) — 2(W2Y) = 0 
ad(W,W:3) + b2(W.Ws) + c2(W3?) + d2(W3W.4) — 2(W3Y) = 0 
aZ(WiW 4) + b2(W2W 4) + c2(W3W 4) + d2z(W2) — 2(WuY) = 0 


The symmetrical arrangement about the diagonal, when 
Y-terms are neglected, is obvious. Starting with any term 
on the principal diagonal, we have the same coefficients 
directly above as to the left. Thus above the diagonal 
term in which the coefficient 2(W3) appears we have the 
coefficients 2(W.W;) and 2(W,1W;). The same coefficients 
are found to the left of the given diagonal term, and on 
the same line. The terms to the left of each diagonal 
entry may be omitted, therefore, and we may write the 
normal equations in the form 
ad(W) + b2(WiW 2) + cZ(WW:) + dd(WiW,) — Z(WiY) = 0 
+ b2(W2?) + cZ(W2Ws) + d2(W2W4) — 2(W2Y) = 0 
+63(W2) + dz(WW.) - 2WaY) = 0 
+ dz(W?) — 2(W4Y) = 0 


Tue DoouttLE Metruop 


The Doolittle method may be illustrated with reference 
to the normal equations given on page 494 of the text. In 
the abbreviated form given above these are 

8 .3564D10.34 + 2.790b13.04 + 2.932b14.23 + 47.967 = 0 


+ 6 .6645613.04 + 2 .063b14.23 + 62.039 = 0 
+ 7.7893b14.23 + 47.519 = 0 


578 THE METHOD OF LEAST SQUARES 


We wish to solve these for the constants. For convenience 
of reference let us represent the constant by34 by A, bis.c4. 
by B, and bis23 by C. All the work of computation, with 
the necessary checks, is shown in the following table: 


TaBLe B 
Solution of Normal Equations by the Doolittle Method 


Li (1) (@) \ (3) (4) (5) (6) 
me | Reciprocals A B Cc 8 
I 8.3564 2.790 2.932 47.967 62.0454 
II 6. 6645 "2.063 62.039 73.5565 
Ul 7.7893 47.519 60. 3033 
1 8.3564 2.790 2.932 47.967 62.0454 
2 | —.11966876 | —1.000000 | —.338876 | —.350869 | —5.740151 | —7.424896 check 
3 6. 6645 2.063 62.039 73.5565 
4 —:931514 | —.978924 | —16.015030 | —20.715470 
5 5.732986 1.084076 |  46.023970 | 52.841030 check 
6 | —.17442917 —1.000000 | —.189094 | -—8.027993 | —9.217017 check 
7 : 7.7893. 47.519 60. 3038 
8 —1.028748 | —16.830133 | —21.769807 
9 —.204992 | —8.702857 | —9. 991922 
10 6.555560 | 21.986010 | _28.541571 check 
11 | —. 15254297 —1.000000 | —3.353796 | —4.353796 check 
Back Solution 
@ B A 
—3.353796  —8.027928  —5.740151 
—3, 353796 +. 634183 +2. 468592 
—7.393740 = +1.176743 
—2.094816 
A = bio.34 = — 2.094816 : 
B = bis.o4 = — 7.393740 
C = biso3 = — 3.353796 
Check: 


Equation I: 
8. 3564b10.34 + 2.790b13.04 + 2.932b14.03 = — 47.967 


Substituting the given values, 


8.3564 (— 2.094816) + 2.790 (— 7.393740) + 2.932 (— 3.353796) 
= — 47.966985 


THE METHOD OF LEAST SQUARES 579 


Explanation. — The coefficients of the unknown quan- 
tities, A, B, and C, are listed in the designated columns. 
The known term in each normal equation is listed in 
column (5). (The sign of this known term, it should be 
noted, is that which it would have when the entire ex- 
pression, of which it is one term, is equated to zero.) 
Column s is employed asacheck. The value in column s, 
in each of the lines I, II and III, is the algebraic sum of 
the known values in the given normal equation. In securing 
this sum the coefficients to the left of the diagonal, which 
have been omitted from the table as it stands, must be 
included. 

The following is a summary of the procedure in solving 
the normal equations: 


1. In line (1) write normal equation I. 

2. In line (2), column (1), write the reciprocal of the value 
in line (1), column (2), with sign changed. (This is the reciprocal 
of the coefficient of A.) Multiply each item in line (1) by this 
reciprocal, entering the products in the corresponding columns in 
line (2). (The algebraic sum of the items in columns (2), (3), 
(4) and (5) of line (2) should equal the value in column (6).) 
This operation has eliminated the unknown 4A, by expressing it 
in terms of B and C. (The —1 in line (2), column (2), has 
been included only to facilitate the checking process. The same 
is true in lines (6) and (11).) A heavy line may be drawn across 
the table below line (2). 

3. Write normal equation IT in line (3). 

4. Multiply by the coefficient of B in line (2) (i.e., — 333876) 
the items in columns (3), (4), (5) and (6) in line (1). Enter 
th> products in the corresponding columns of line (4). 

5. Add lines (3) and (4), entering the sums in line (5). (The 
algcbraic sum of the items in columns (3), (4) and (5) of line 
(5) should equal the value in column (6).) 

6. \In column (1), line (6), enter the reciprocal of the value 
in column (3), line (5), reversing the sign. Multiply each term 
in litte (5) by this reciprocal, entering the products in line (6). 
(The sum of the items in columns (3), (4) and (5) of line (6) 


580 THE METHOD OF LEAST SQUARES 


should equal the value in column (6).) This operation has 
eliminated the unknown B, by expressing it in terms of C. A 
heavy line may be drawn across the table below line (6). 

7. Write normal equation III in line (7). 

8. Multiply by the coefficient of C in line (2) (i.e., —.350869) 
the items in columns (4), (5), and (6) of line (1). Enter the 
products in the, corresponding columns of line (8). 

9. Multiply by the coefficient of C in line (6) (i.e., —. 189094) 
the items in columns (4), (5) and (6) of line (5). Enter the 
products in the corresponding columns of line (9). 

10. Add lines (7), (8) and (9), entering the sums in line (10). 
(The algebraic sum of the items in columns (4) and (5) of line 
(10) should equal the value in column (6).) 

11. In column (1), line (11), enter the reciprocal of the value 
in column (4) of line (10), reversing the sign. Multiply each 
term in line (10) by this reciprocal, entering the products in 
line (11). (The algebraic sum of the items in columns (4) and 
(5) of line (11) should equal the value in column (6).) | This 
operation gives the value of C, which is found in column (5) 
of line (11). A heavy line may be drawn across the table below 
line (11). 


Were there additional unknowns, as D and E, this last 
operation would have given C as a function of D and E 
and it would be necessary to carry the process still further, 
repeating the steps taken above. The next operation would 
be to bring down the fourth normal equation, entering it 
in line (12). Then the coefficients of D in lines (2), (6) 
and (11) would be used to multiply the necessary items 
in lines (1), (5) and (10), the products being entered in lines 
(13), (14) and (15). The sum of the items in lines (1%), 
(13), (14) and (15) would be entered in line (16) and checked 
by the item in the s column. Multiplying through by the 
reciprocal of the coefficient of D in line (16), with sign re- 
versed, the value of D would be obtained in terms of E. 
The value ef E would be derived in a similar fashion. 

The checks on these various operations have been indi- 


THE METHOD OF LEAST SQUARES 581 


cated in the table. The testing of the results at each step 
reduces the possibility of error to a minimum. 

The back solution presents no difficulties. We have, 
from line (11), 


C = — 3.353796 


from line (6) 
B= — .189094C — 8.027922 


from line (2) 
A = — .333876B — .350869C — 5.740151 


(The items in column (6) are inserted merely as checks. 
The items — 1.000000 which appear in lines (2), (6) and 
(11) are inserted to assist in the checking.) 

The computations involved in the back solution appear 
in the table. 

A final check is afforded by inserting the values secured 
by this process in one of the normal equations. This check, 
as carried out for equation I, is shown below the table. 


REFERENCES 


Apams, Oscar 8. Application of the Theory of Least Squares to 
the Adjustment of Triangulation. Special Publication No. 28, 
U.S. Coast and Geodetic Survey, 1915. 

Brunt, Davin. The Combination of Observations. 

Huntineton E. V. Curve Fitting by the Method of Least Squares 
and the Method of Moments (in Rietz, H. L. ed. Handbook of 
Mathematical Statistics, 62-70). 

Merriman, Mansrietp. The Method of Least Squares. 

Smitu, Braprorp B. The Use of Punched Card Tabulating 
Equipment in Multiple Correlation Problems. Washington, 
Bureau of Agricultural Economics, 1923. 

Wetp, L. D. Theory of Errors and Least Squares. 

Wuiraker, E. T. and Rosrinson, G. The Calculus of Observations 
(209-259). 

Wricut and Hayrorp. Adjustment of Observations. 


APPENDIX B 
GLOSSARY OF SYMBOLS 


The following are the most important symbols employed 
in the preceding pages. A given symbol is sometimes called 
upon to serve different purposes, but the precise meaning 
should be clear from the context. 


1. General symbols for variables and constants: 


x: a variable quantity. 

y: a variable quantity. 
In general, any letter near the end of the alphabet may 
be employed to represent a variable quantity. Different 
variable quantities may be represented by the use of a 
single symbol, with different subscripts, as X1, X2, X3, or 
Wi, We, W3. [A distinction is later drawn (cf. Symbols 
employed in the measurement of relationship) between 
capital letters and small letters, as used to represent 
variable quantities. | 

a: a constant (i.e., a quantity the value of which does not 
change in the given discussion). In general, any letter 
near the beginning of the alphabet may be used to 
represent a constant. 


2. Symbols employed in the analysis and description of 
the frequency distribution: 


m: the value of an individual observation; the value of the 
mid-point of a class. (The symbols a, a2, a3 are some- 
times employed to represent different observations in 
a series.) 

f: the number of observations in a given class; the frequency 
of a given class. 

a: the class-interval. 
1: the lower limit of a class. 
582 


GLOSSARY OF SYMBOLS 583 


N: the total number of cases in a given series or frequency 
distribution. 

d: the deviation of a given observation from an average; 
usually, a deviation from the arithmetic mean. When 
written with a subscript, as d, or d,, it refers to a devia- 
tion from the arithmetic mean of the variable represented 
by the subscript. The symbol d is sometimes used to 
designate the difference between mean and mode. 

d’: the deviation of a given observation from an arbitrary 
origin, or assumed mean. 

c: the difference between an arbitrary origin, or assumed 
mean, and the true mean (in terms of the symbols ex- 
plained below, c = M — M’). 

2(Sigma): the symbol for the process of summation. Thus 
2d means the sum of all the deviations. 

W1, W2, W3: weights attached to a series of measures being 
averaged. (Not to be confused with similar symbols used 
to represent different variable quantities.) 

Yo: the maximum ordinate of a frequency curve. 


Symbols for averages, quartiles, etc.: 

M: the arithmetic mean. 
Md.: the median. 
Mo.: the mode. 
M,: the geometric mean. 

H: the harmonic mean. 
M’: the value of an assumed arithmetic mean. 
Q,: the first or lower quartile. 

Qs: the second quartile or median. 

Q3: the third or upper quartile. 

K: the value of a point midway between the first and third 

quartiles 
D;: the third decile. 


Symbols for measures of variation and skewness: 


M.D.: the mean deviation. 
o: the standard deviation; the root-mean-square deviation 


about the arithmetic mean. 
s: the root-mean-square deviation about an origin other than 


the arithmetic mean. 


584 


P.E.: the probable error. 


OD: 


ne 


Qe: 


V: 
sk: 


GLOSSARY OF SYMBOLS 


the quartile deviation. 
the difference between the median and the lower quartile 
(Md.—Q,). 
the difference between the upper quartile and the median 
(Q; — Md.). 
the coefficient of variation. 
a measure of skewness. 


x(Chi): a measure of skewness based upon the criteria 


G1 and By. (These criteria are explained below.) 


3. Symbols relating to index numbers. 


: price of a given commodity at time “0” (the base period). 
: quantity of same commodity at time “0”. 

’: price of same commodity at time “1”. 

’; quantity of same commodity at time “1” 

: price of a second commodity at time “0”. 

: quantity of second commodity at time “0”. 

: price of second commodity at time “1”. 

: quantity of second commodity at time “1”’. 


: a price relative (relation of price of a given commodity 


at time “1” to price of same commodity at time “‘0”’). 


—-;: a quantity relative. 


q / 


Py: 
Pi: 


price level at time “0”. 
price level at time “1”. 


4. Symbols employed in the measurement of relationship. 


De 
Ye 


XxX: 


an observed value of a variable quantity. 

an observed value of a variable quantity. (The observed 
values of different variables may be represented also by 
the symbols Xi, X2, X3, or Wi, Wo, W3). 

the arithmetic mean of a number of observed values of 
the variable X. A similar symbol may be employed for 
other variables. (In one demonstration in the preceding 
pages, relating to multiple correlation, the symbols Aj, 
A», Az3 ... are used to represent the arithmetic means 


GLOSSARY OF SYMBOLS 585 


of the variables X;, X2, X3;... The symbols M, 
and M, are occasionally employed to designate the 
arithmetic means of different variables.) 

x: value of a variable quantity expressed as a deviation 
from the arithmetic mean of all the observed values. 
The symbol y and the symbols 2, 2x2, x3... are 
similarly employed with respect to variables represented, 
as to original observations, by the symbols Y, X1, Xo, 
pe haar 

x’: a value of a variable quantity expressed as a deviation 
from an arbitrary origin. The symbol y’ has a similar 
meaning. 

Y,: the computed or estimated value of a variable, as 
determined from an equation of average relationship; 
the symbol y, may be employed for such a computed 
value, expressed as a deviation from the mean. 

p: the mean product of two variables when expressed as 
deviations from their respective arithmetic means, i.e., 


p= oe When written with subscripts, as pr, 
the latter relate to the variables in question, as 2, 22. 
p’: the mean product of two variables when expressed as 
afi: : : ‘ p(y!) 
deviations from assumed arithmetic means, i.e., p Soe aa 
r: the Pearsonian coefficient of correlation. When written 
with subscripts, the latter indicate the variables to which 
the coefficient relates. Thus 7,, refers to the variables 

y and a, and ry refers to the variables 2 and a. 
p(Rho): a general index of correlation (not to be confused 
with Spearman’s coefficient of correlation based upon the 
squares of differences in rank). Subscripts should be em- 
ployed to indicate the variables to which the measure 
relates, a8 P45 Py P logy? P logy loge? Py etc. (In each 


case the first subscript relates to the dependent variable.) 
d: the deviation of a given observation from a fitted curve; 
the difference between an observed and a corresponding 
computed value of a variable. 
v: a residual; identical in meaning with d, as given above. 


586 GLOSSARY OF SYMBOLS 


S: the root-mean-square deviation about a fitted curve; 
the standard error of estimate. This measure should 
be written with a subscript to indicate the variable to 
which it applies, as S,, S., S tog, (the standard error of 
estimate in terms of logarithms), S, (the standard error 
of estimate in terms of ratios), Si (the standard error of 

7] 


estimate in terms of reciprocals). 

n (Eta): the correlation ratio. Subscripts should be em- 
ployed to represent the variables to which the measure 
relates, aS yz OF Nzy. The first subscript in each case 
relates to the dependent variable. 

Cay: the root-mean-square deviation about a line through the 
means of the columns of a correlation table; the standard 
deviation of the y-arrays about their respective means. 
The symbol o., has the same meaning with respect to 
the rows of a correlation table, or the z-arrays. 

Omy: the standard deviation of the means of the columns of 
a correlation table about the mean of all the y’s, the 
mean of each column being weighted by the number of 
items in that column. The symbol onz has the same 
meaning with respect to the means of the rows. 

¢ (Zeta): the test for linearity of regression (¢ = 7? — 7), 

Kk: the number of arrays employed in the computation of a 
given correlation ratio. 

b: the coefficient of regression; the slope of a line of regres- 
sion. When written with subscripts, the latter relate 
to the variables in question, as byz, bi. (for the variables 
21, x2). The first subscript relates to the dependent 
variable in each case; 6b,,. is the coefficient of regression 
of y on a and b,, is the coefficient of regression of x on y. 

Ri.234: the coefficient of multiple correlation between a 
dependent variable, 7, and a combination of independent 
variables, a, x3, and x4. The order may be changed, 
but the primary subscript always relates to the de- 
pendent variable. 

T12.34: the coefficient of partial or net correlation between 
the variables 7; and 22, when the variables 23 and x, are 
held constant. The order of subscripts is changed for a 
different combination of variables, the two primary 


GLOSSARY OF SYMBOLS 587 


subscripts always relating to the variables between which 
the net correlation is being measured. 

bie.34: the coefficient of net regression between the variables 
x, and 2X2, the former being dependent, when the variables 
x3 and x, are also taken account of in the estimating 
equation; the weight given to 22 in estimating x,, when 
the estimate is also based upon values of x3 and 2x4. 
The order of subscripts is changed for a different com- 
bination of variables. 

S1.2347 the root-mean-square deviation about a line describing 
the relationship between a dependent variable, x, and 
a series of independent. variables, x2, x3; and 2x4; the 
standard error of estimate of 2; under these conditions. 

01.234: the standard deviation of the fourth order; identical 
with S1.234. 


(In all the measures immediately above, the number of subscripts corresponds 
to the number of variables included in a given study. For the sake of simplicity, 
only four variables have been assumed.) 


5. Symbols employed in the measurement of errors. 


oy: the standard error of the arithmetic mean. 

Oc: the standard error of the standard deviation. Similarly, 
the symbol o, with any subscript, represents the standard 
error of the measure to which the subscript relates. 

P.E.y: the probable error of the arithmetic mean (P.E.y 
= .67449 gy). Similarly, the symbol P.E. with any sub- 
script represents the probable error of the measure to 
which the subscript relates. 


6. Other symbols. 


p: the probability of the successful outcome of a given event. 

q: the probability of an unsuccessful outcome of a given 
event. 

: the sum of the known quantities in an observation equa- 
tion; used in checking the formation and solution of 
the normal equations. 

A (Delta): a general symbol for the difference between two 

magnitudes or between successive values of a variable 


=e 


588 


GLOSSARY OF SYMBOLS 


quantity. The variable to which it refers may be repre- 
sented by a subscript, as A,. First differences are 
represented by A!, second differences by A’, etc. 

V2, V3, ete.: moments of a frequency distribution about 
an arbitrary origin. 

2, 13, etc.: uncorrected moments of a frequency dis- 
tribution about the arithmetic mean. 

M2, Ms, ete.: moments of a frequency distribution about 
the arithmetic mean after the application of Sheppard’s 
corrections. (If Sheppard’s corrections are not necessary, 
these symbols would be employed for the uncorrected 
moments about the mean.) 


gyits 
ae 
. a 


Me? 
A criterion of curve type based on (; and B:. 


: A quantity used in testing goodness of fit. 


Bibliothéque, 
Université du Québet, 
Rimouski 


LIST OF REFERENCES 


Apams, Oscar 8. Application of the Theory of Least Squares 
to the Adjustment of Triangulation. Special Publication 
No. 28. U.5. Coast and Geodetic Survey. 1915. 

Bartow. Tables of Squares, Cubes, Square Roots, Cube Roots, 
Reciprocals. Spon and Chamberlain. New York. 1919. 

Beckett, 5. H.,and Rosrertson, R. D. The Economical Irriga- 
tion of Alfalfa in the Sacramento Valley. Bulletin 280, 
Agricultural Experiment Station. University of California. 
May, 1917. 

Bowtey, A. L. Elements of Statistics. P. S. King and Son, 
London. 1920. 

Brinton, W. C. Graphic Methods for Presenting Facts. The 
Engineering Magazine Co., New York. 1914. 

Broan, C. D. On the Relation between Induction and Probability. 
Mind N.S. Vol. 27, 1918, Vol. 29, 1920. 

Brunt, Davip. The Combination of Observations. Cambridge 
University Press. 1917. 

Cuappock, R. E. Principles and Methods of Statistics. Houghton 
Mifflin, Boston. (In preparation.) 

CuarK, Wauuace. The Gantt Chart. Ronald Press, New York. 
1922. 

Crum, W. S. The Use of the Median in Determining Seasonal 
Variation. Journal of the American Statistical Association. 
March, 1923. 

Davenport, E. Comparative Agriculture. In Bailey’s Cyclo- 
pedia of American Agriculture. 

Davies, G. R. Introduction to Economic Statistics. Century, 
New York. 1922. 

Day, Epmunp E. An Index of the Physical Volume of Production. 
Review of Economic Statistics. Sept., 1920, Jan., 1921. 
Standardization of the Construction of Statistical Tables. 
Quarterly Publications of the American Statistical Associ- 

589 


590 LIST OF REFERENCES 


ation. March, 1920. The Volume of Production of Basic 
Raw Materials in the United States. Review of Economic 
Statistics. July, 1922. 

Epceworts, F. Y. On Correlated Averages. Phil. Mag., 5th 
series. Vol. 34. 1892. 

Epirorrau. On the Probable Errors of Frequency Constants. 
Biometrika, Vol. 2 (273-281). 

Euperton, W. Pauin. Frequency Curves and Correlation. Layton, 
London. 1906. 

Ezexiet, M. J. B. A Method of Handling Curvilinear Correla- 
tion for Any Number of Variables, Journal of the American 
Statistical Association, Dec., 1924. 

FauLKNER, Heten D. The Measurement of Seasonal Variation. 
Journal of the American Statistical Association, June, 1924. 

Fieip, J. H. Some Advantages of the Logarithmic Scale in Statis- 
tical Diagrams. Journal of Political Economy. Oct., 1917. 

FisHer, ARNE. An Elementary Treatise on Frequency Curves. 
Macmillan, New York. 1922. The Mathematical Theory of 
Probabilities. Macmillan, New York. 1922. 

Fisuer, Irvine. A Weekly Index Number of Wholesale Prices. 
Journal of the American Statistical Association. Sept,, 1923. 
The Making of Index Numbers. Houghton Mifflin, Boston. 
1922. The “Ratio”? Chart. Quarterly Publications of the 
American Statistical Association. June, 1917. Revision of 
the Weekly Index Number. Journal of the American Statis- 
tical Association, Sept., 1924. 

Frux, A. W. The Measurement of Price Changes. Journal of 
the Royal Statistical Society. March, 1921. 

Gatton, Francis. Correlations and Their Measurement. Pro- 
ceedings of the Royal Society. Vol. 45, 1888. 

Grirrin, F. L. Introduction To Mathematical Analysis. Houghton 
Mifflin, Boston. 1922. 

Haas, G. C. Sale Prices as a Basis for Farm Land Appraisal. 
Technical Bulletin, No. 9. University of Minnesota Agri- 
cultural Experiment Station. Nov., 1922. 

Hau, Lincoun W. Seasonal Variation as a Relative of Secular 
Trend. Journal of the American Statistical Association, 
June, 1924. 

Hart, W. L. The Method of Monthly Means for Determination 


LIST OF REFERENCES 591 


of a Seasonal Variation. Journal of the American Statistical 
Association. Sept., 1922. 

Haske, A.C. Graphic Charts in Business. Codex Book Co., 
New York. 1922. 

How to Make and Use Graphic Charts. Codex Book Co., New 
York. 1919. 

Jones, D. C. A First Course in Statistics. Bell, London. 1921. 

Karsten, Karu. Charts and Graphs. Prentice Hall, New 
Vork, 1923. 

Kewitey, Truman L. _ Statistical Method. Macmillan, New 
York. 1923. 

Keynes, J. M. A Treatise on Probability. Macmillan, New 
York. 1921. 

Kitiovues, H. B. A Statistical Analysis of Oat Prices. Bureau of 
Agricultural Economics. 

Kine, W. I. Elements of Statistical Method. Macmillan, New 
York. 1912. An Improved Method for Measuring the Seasonal 
Factor. Journal of the American Statistical Association, 
Sept., 1924. 

Kniss, G. H. The Theory and Justification of Curve Smoothing. 
In H. Secrist, Readings and Problems in Statistical Methods. 
Macmillan, New York. 1920. 

Kurtz, Epwin. Replacement Insurance. Administration, July, 
1921. 

Liexa, JosepuH. Graphical and Mechanical Computation. Wiley, 
New York. 1918. 

Me ttor, J. W. Higher Mathematics for Students of Chemistry 
and Physics. Longmans, London. 1922. 

Merriman, Mansrievp. The Method of Least Squares. Wiley, 
New York. 1897. 

Mitts, FrepericK C. On Measurement in Economics (in Tugwell, 
R. G. ed., The Trend of Economics. Knopf, New York. 1924). 

Miner, J. R. Tables of 1 — 7? and 1 —r* for use in Partial 
Correlation and Trigonometry. Johns Hopkins Press, Balti- 


more. 1922. 
Mitcuetit, W. C. History of Prices During the War. Price 


Bulletin, No. 1, War Industries Board 1919. The Making 
and Using of Index Numbers. Part I. Bulletin No. 284. 
U. S. Bureau of Labor Statistics. Oct., 1921. 


592 ‘LIST OF REFERENCES 


Moorz, H. L. Economic Cycles: Their Law and Cause. 
Macmillan, New York. 1914. Elasticity of Demand and 
Flexibility of Prices. Journal of the American Statistical 
Association. March, 1922. Empirical Laws of Demand and 
Supply and the Flexibility of Prices. Political Science 
Quarterly. Dec., 1919. Forecasting the Yield and the Price 
of Cotton. Macmillan, New York. 1917. Generating 
Economic Cycles. Macmillan, New York. 1923. 

NationaL Bureau or Economic Resnarcu. Income in the 
United States. (Edited by W. C. Mitchell.) Harcourt 
Brace and Co., New York. 1921. 

Nortu Daxota AGRICULTURAL COLLEGE. Cost of Production 
and Farm Organization on 126 Farms in North Dakota. 
Bulletin No. 165, Agricultural Experiment Station. 1922. 

Ocsurn, W.F., and THomas, Dorotuy. Influence of the Business 
Cycle on Certain Social Conditions. Quarterly Publications 
of the American Statistical Association. Sept., 1922. 

Peake, E. G. An Academic Study of Some Money Market and 
Other Statistics. P.S. King, London. 1923. 

Peart, R., and Reep, L. J. Predicted Growth of Population of 
New York and Its Environs. Committee of Plan of New 
York. 1923. 

Peart, Raymonp. Medical Biometry and Statistics. Saunders, 
Philadelphia. 1923. 

Pearson, Karu. Mathematical Contributions to the Theory of 
Evolution. On the General Theory of Skew Correlation and 
Non-Linear Regression. Draper’s Company Research Mem- 
oirs. Cambridge University Press. 1905. Notes on the 
History of Correlation. Biometrika, Vol. 13. On a Correc- 
tion Needful in the Case of the Correlation Ratio. Biometrika, 
Vol. 8. -On the Correction Necessary for the Correlation Ratio. 
Biometrika, Vol. 14. Regression, Heredity and Panmizia. 
Phil. transactions, Royal Society, Series A, Vol. 187. 1896. 
Tables for Statisticians and Biometricians. Cambridge Univer- 
sity Press. 1914. The Fundamental Problem of Practical 
Statistics. Biometrika, Vol. 13. 

- Persons, Warren M., and Corte, Eunice. A Commodity 
Price Index of Business Cycles. Review of Economic Sta- 
tistics. Prel. Vol. 3. 


LIST OF REFERENCES 593 


Prrsons, WarrEN M. An Index of Trade for the United States. 
Review of Economic Statistics. April, 1923. Correlation 
of Time Series. Journal of the American Statistical Associa- 
tion. June, 1923. Fisher's Formula for Index Numbers. 
Review of Economic Statistics. Prel. Vol. 3. Indices of 
Business Conditions. Review of Economic Statistics. Prel. 
Vol. 1. 1919. The Variate Difference Correlation Method 
and Curve Fitting. Quarterly Publications of the American 
Statistical Association. June, 1917. 

Prescott, Raymonp. Law of Growth in Forecasting Demand. 
Journal of the American Statistical Association. Dec., 1922. 

Rietz, H. L. ed. Handbook of Mathematical Statistics. Houghton 
Mifflin, Boston. 1924. 

Rierz, H. L., and Cratuorne, A. R. College Algebra. Holt, 
New York. 1917. 

Ruee, H.O. Statistical M ethods Applied to Education. Houghton 
Mifflin, Boston. 1917. 

Runnine, T. R. Empirical Formulas. Wiley, New York. 1917. 

Scuuttze, Artuur. Graphic Algebra. Macmillan, New York. 
1918. 

Scnuttz, Henry. The Statistical Measurement of the Elasticity 
of Demand for Beef. Journal of Farm Economics. July, 1924. 

Srcrist, Horace. Introduction to Statistical Methods. Macmillan, 
New York. 1917. Readings and Problems in Statistical 
Methods. Macmillan, New York. 1920. 

Suepparp, W. F. On the Calculation of the Most Probable Values 
of Frequency Constants for Data Arranged According to 
Equi-distant Divisions of a Scale. Proceedings of the Lon- 
don Mathematical Society, Vol. 29. 1898. The Calculation 
of the Moments of a Frequency Distribution. Biometrika, 
Vol. 5. 

Smita, Braprorp B. The Use of Punched Card Tabulating 
Equipment in Multiple Correlation Problems. (Prepared for 
the use of statisticians of the Bureau of Agricultural Eco- 
nomics. U.S. Dept. of Agriculture.) 1923. 

Snow, E. C. Trade Forecasting and Prices. Journal of the 
Royal Statistical Society. May, 1923. 

Snyper, Cart. A New Index of the Volume of Trade. Journal 
of the American Statistical Association. Dec., 1923. 


594 LIST OF REFERENCES 


Sramp, J. C. The Effect of Trade Fluctuations upon Profits. 
Journal of the Royal Statistical Society. July, 1918. 
Sreinmetz, C. P. Engineering Mathematics. McGraw Hill, 

New York. 1917. 

Srewart, Ernevsert. Labor Efficiency and Productiveness in 
Sawmills. Monthly Labor Review. Jan., 1923. 

Toutey, H. R., and Ezexier, M. J. B. A Method of Handling 
Multiple Correlation Problems. Journal of the .American 
Statistical Association. Dec., 1923. 

Wausy, C. M. The Measurement of General Exchange Value. 
Macmillan, New York. 1901. The Problem of Estimation. 
P. S. King and Son, London. 1921. 

We tp, L. D. Theory of Errors and Least Squares. Macmillan, 
New York. 1916. 

West, Caru J. Introduction to Mathematical Statistics. Adams, 
Columbus. 1918. 

Wurrpie, G. C. Vital Statistics. Wiley, New York. 1919. 

Wuitaker, E. T., and Rogpinson, G. The Calculus of Observa- 
tions. Blackie and Son, London. 1924. 

WuiteneaD, A. N. An Introduction to Mathematics. Holt, 
New York. 1911. 

Workine, Hotsroox. Factors Determining the Price of Potatoes 
in St. Paul and Minneapolis. Technical Bulletin, No. 10, 
University of Minnesota Experiment Station. Oct., 1922. 

Warieut, T. W., and Hayrorp, J. F. Adjustment of Observations. 
Van Nostrand, New York. 1906. 

Youne, Attyn A. The Measurement of Changes in the General 
Price Level. Quarterly Journal of Economics. Aug., 1921. 

Yue, G. Upnry. An Introduction to the Theory of Statistics. 
Griffin, London. 1919. On the Time Correlation Problem, 
with Especial Reference to the Variate Difference Correlation 
Method. Journal of the Royal Statistical Society. July, 
1921. 

Zi1zEK, Franz. Statistical Averages. Holt, New York. 1913. 


INDEX 


Abscissa, 12 

Acme Corporation, sales, 40, 41, 42 

Administration, 3, 4 

Aggregates of actual prices, simple, 
193, 194, 195; weighted, 207, 208 

Aggregative construction of index 
numbers of prices, 193 

Agricultural products, 228 

Agriculture. See U.S. Department of 
Agriculture 

Alfalfa yield and irrigation, 433, 435, 
436 

Allocation of costs, 7 

American index numbers of wholesale 
prices, 229; Bradstreet’s, 234, 243; 
Dun’s, 236, 243; Federal Reserve 
Board, 231, 243; Fisher’s weekly 
index, 237, 243; Persons’ commod- 
ity price index of business cycles, 
238, 243; U.S. Bureau of Labor 
Statistics, 229, 242; Various Sys- 
tems, 241; War Industries Board, 
233, 243 

American Telephone and Telegraph 
Co., 308, 309; composite index of 
general business activity, 357, 358; 
composite index correlated with stock 
price index, 422; general business 
curve, 356; study of annual message 
use of residence subscribers, 527, 528, 
535, 536, 543 

Analysis, 7 

Animal and forest products, 228 

Annalist’s food price index, 241, 243 

Anti-logarithms, 28 

Arithmetic averages of relative prices, 
195, 196; of egg prices, 316 

Arithmetic mean, 112, 145, 453, 454; 
characteristic features, 144; compu- 
tation of, 113; short method of 
computing, 115, 118; weighted, 115 

Arithmetic measures, geometric and 
harmonic compared with, 475 


Arithmetic series, 18 

Array of data, 63, 64 

Artillery firing, 100, 101, 102 

Astronomical observations, 99 

Asymmetry. See Skewness 

Automobile speed, 141 

Average relationship, 366, 367, 375 

Averages, 97, 109, 111; arithmetic, 195, 
196, 216; characteristic features of 
the chief averages, 144; criteria for 
use of, 140; in distribution of price 
changes, 187; references, 146; rela- 
tions between different averages, 143; 
symbols, 583; weighted average, 115; 
when significant, 148. See also Mean; 
Moving averages 


Bank clearings in New York City, 
measurement of trend, 259, 260, 261, 
268 

Bank discount rates, relations, 379, 
380, 381, 382, 384, 385, 391, 395, 
396, 397 

Barlow’s Tables, 142 

Beckett, S. H., 433 

Binomial expansion, 520 

Bituminous coal production, 333; actual 
production, 337, 339, 340; changes in 
rate of growth, 335, 336; computation 
of cycles, 336, 337; corrected devia- 
tions from normal, together with ac- 
tual monthly, 338; measurement of 
cyclical changes, 334 

Blakeman, John, 557 

Bonds, 125, 127 

Bowley, A. L., 167 

Bradley, 99 

Bradstreet’s index numbers of whole- 
sale prices, 173, 195, 222, 226, 253, 
307; description, 234, 243 

Brinton, W. C., 51 

Buffalo, N. Y., 527 

Building contracts, 307; actual and de- 


595 


596 


flated values, 313; A. T. & T. Co. 
deflated values, 308, 309 

Business, 3; A. T. & T. Co. curve and 
composite index, 356, 357, 358; 
classes of activity, 3; quantitative 
character of its problems, 6 

Business cycles, 8, 223; Persons’ com- 
modity price index, 238; stock price 
cycles as related to, 421, 423, 424, 
425, 426 

Business failures, 284; determining the 
secular trend, 284, 285, 288; second 
degree parabola and, 286; straight 
line and, 286; third degree parabola 
and, 287; with three lines of trend, 
289 


Catastrophes, 258 

Central tendency, 107, 148; geometric 
mean as a measure of, 139; measures 
of, 109 

Chain relatives, 319 

Chaining index numbers, 218 

Chance, 516, 518; comparison of actual 
and theoretical frequencies, 522. See 
also Probabilities 

Characteristic, 27 

Charlier, C. V. L., 157 

Charts, 36; comparison of frequencies, 
44; component parts, 45; construc- 
tion, 36; cumulative, 46, 47; Gantt 
progress, 48, 49, 50, 51; logarithmic 
and semi-logarithmic, 32, 39; selec- 
tion, 36; time series, 37 

Chi-square test of goodness of fit of 
frequency curve, 543 

Classes, 70; boundaries, 71; determina- 
tion, 70; indeterminate, 72 

Classification of quantitative data, 64 

Class-intervals, 68; uniformity, 71 

Class limits, 69 

Coal. See Bituminous coal production 

Coal and petroleum production, 171; 
index numbers, 170, 171; weighted 
index numbers, 172, 173 

Coefficient of correlation, 372, 375, 
391; coefficient of regression and, 
503; effect of departure from normal, 
406; measurement of time sequence 
and, 420; product-moment formula, 
385; relation to correlation ratio, 450 

Coefficient of multiple correlation, 496 


INDEX 


Coefficient of regression, 394; coeffi- 
cient of correlation and, 503 

Coefficient of variation, 165 

Coin tossing, 101, 102, 103 

Column diagrams, 74, 75, 76, 77, 83, 
84, 85 

Commodities, change in prices, 104; 
number to be used in constructing 
index numbers, 222 

Commodity index index of business 
cycles, 238 

Comparability of data, 253 

Comparison of frequencies, 44 

Comparison of frequency distributions, 
97 

Component parts, 45 

Compound interest, 34, 35; computing, 
138 

Concentration, 109; tendency toward, 
111 

Condensation of data, 7, 67 

Constants, 14; symbols for, 14, 582 

Continuous series, 85 

Control, 5 

Coordinate geometry, 11 

Coérdinates, 11, 13; on diagrams, 55; 
rectangular, 11 

Corn yield and temperature, 485, 486, 
510, 511; three independent variables 
and, 490 

Correlation, linear, 363; multiple and 
partial, 485; non-linear, 432; partial 
and simple distinguished, 502; partial 
or net, 500; terms used in measure- 
ment, 393. See also Linear corre- 
lation; Multiple correlation, etc. 

Correlation ratio, 441; computation, 
442, 447; correction, 449; relation 
to coefficient of correlation, 450 

Correlation table, 379; construction, 
378; discount rates of Federal Re- 
serve banks and of commercial banks, 
381; tabulation of items, 380 

Cost accounting, 7 

Cost of living, index numbers, 245; 
wages in relation to, 250 

Cotton, 13, 15; consumption, correc- 
tions for seasonal variation, 354; 
domestic consumption, index pum- 
bers, 169; production and _ prices, 
computing coefficient of correlation, 
428; production and prices, correla- 


INDEX 


tion of cyclical fluctuations, 413, 414, 
415, 416 

Coyle, E. S., 239 

Cumulative arrangement of data, 88 

Cumulative charts, 46 

Cumulative frequency curve, 91, 92; 
weekly earnings of employees, 134 

Cumulative tables, 90 

Curve of error. See Normal curve of 
error 

Curves, 16, 20; on diagrams, 55; fac- 
tors determining choice, 290; hyper- 
bolic, 20; normal, criteria, 531; 
parabolic, 20; potential series, fitting, 
280; representation of secular trend 
by, 271; scale, reading, 56; sine, 20, 
25, 26; smoothing, 80, 87; three 
families for relations of variables, 
472; types, 20, 26. See also Fitting 
of curves 

Cycles. See Business cycles 

Cyclical fluctuations, 258; correlation 
between, 412; “lead” and “‘lag”’ in 
correlation, 420; measurement, 315, 
332; references, 343 


Data, comparability, 253; condensa- 
tion, 67; cumulative arrangement, 
88; homogeneity, 129; organization, 
61; plotting, 300; quantitative, 6, 
7; unorganized, 62 

Dates on diagrams, 54 

Davenport, D. H., 331, 461 

Davis, Cal., 433 

Day, E. E., 205, 347, 348, 350, 351 

Deciles, 124; graphic location, 131, 
133, 134 

Deflating index, 308, 309 312 

Deflation, 307 

Depreciation, 90 

Descartes, René, 11 

Deviation, mean, 149, 150, 151; quar- 
tile, 158; standard, 154, 156, 162, 
163, 368, 405, 513 

Diagrams, 52 

Dice-throwing experiment, 523, 554; 
actual and theoretical probabilities 
compared, 524, 525 

Discount rates, 122 

Discrete series, 85, 87 

Dispersion, 147, 160. See also Varia- 
tion 


597 


Dispersion, zone of, in artillery firing, 
100 

Dodge, F. W., & Co., 307 

Doolittle method, 577, 578 

Dun’s index of wholesale prices, 236, 
243 


-Economic data, 6; curves for, 297, 300; 


curves and histograms, 106; distri- 
bution, 103; plotting, 300 

Economics, 3, 9; quantitative charac- 
ter of its problems, 6 

Edgeworth, F. Y., 523 

Egg prices, 315; arithmetic averages, 
316; index numbers of seasonal vari- 
ation based on percentage ratios 
to trend, 326; indices of seasonal 
variation, adjusted, 327; link rela- 
tives, 318, 319; moving averages, 
321, 322 

Elderton, W. P., 544 

Elimination of irrelevant factors, 501 

Empirical curves, 283, 284 

Employees, distribution on the basis of 
weekly earnings, 65, 72, 74 

Error, probable. See Probable error 

Error, standard. See Standard error 

Errors, symbols employed in measure- 
ment of, 587 

Estimates, 370; assumptions involved, 
453; problem of estimation, 453; 
references, 484; zones of estimate, 
458; zones of estimate and their sig- 
nificance, 477 

Exchange rates, 105, 106, 149, 151; 
London-Paris, 156, 164, 165 

Expense items, 7 

Exponent, 27 

Exponential curve, 20, 23, 24 

Extrapolation, 304 

Ezekiel, M. J. B., 491, 499 


Factor reversal test, 212, 213 

Farm crop prices, 191; comparison of 
five types of index numbers, 201, 202; 
comparison of weighted index num- 
bers, 216, 217; index numbers, 191, 
192, 194; weighting of index num- 
bers, 205 

Farm products, index numbers of price 
and buying power, 246, 247 

Farms in New England, 44, 45 


598 


Fechner, G. T., 125 

Federal Reserve bank discount rates, 
relation to commercial bank rates, 
379, 380, 381, 382, 384, 385, 391, 395, 
396, 397 

Federal Reserve Bank, New York, 223; 
deflation index, 312; index of trade 
volume, 359, 361; wholesale price 
index, 241, 243 

Federal Reserve Board, index numbers 
of wholesale prices, 231, 243; index 
of production in basic industries, 
352, 353; other index numbers, 
355 

First order coefficients of correlation, 
507; computation, 509, 510 

Fisher, Arne, 545 

Fisher, Irving, factor reversal test, 212, 
213; ‘ideal’ index, 215, 217, 219, 
346;. new weekly index number of 
wholesale prices, 237, 243; substitute 
formula for “‘ideal”’ index, 218; time 
reversal test, 203, 204, 212; types of 
index numbers, 193; weighting 
methods for index numbers, 209 

Fitting a straight line, 273; for nine 
points, 273; special cases, 278 

Fitting of curves, 271; normal curve, 
532, 536; potential series, 280; short 
cut methods, 298; testing goodness 
of fit of normal, 535, 540 

Five per cent bonds, frequency distri- 
bution, 125 

Fluctuations, 257; cyclical, measure- 
ment, 315; random, 258; seasonal, 
measurement, 315; short term, 427. 
See also Business cycles; Cyclical 
fluctuations 

Food price indexes, 241; U.S. Bureau of 
Labor Statistics index of retail prices, 
244 

Frequencies, comparison, 44; determi- 
nation of theoretical, 537, 540 

Frequency curves, 45, 545; cumulative, 
91; distribution of errors of astro- 
nomical observation, 99; distribution 
of soldiers by height, 98; ogive and, 
relation, 93, 94; personal income re- 
cipients, 86; smooth, 81; weekly 
earnings of employees, 132 

Frequency distributions, 61; compari- 
son, 97; description, 97, 147; de- 


INDEX 


scription, note on, 545; employees 
on the basis of weekly earnings, 65, 
72, 74; general characteristics, 107; 
graphic representation, 74; message 
use of residence (telephone) sub- 
scribers, illustrating computation of 
moments, 527, 528; methods of 
describing, 109; moments, 528; price 
ratios, 179; references, 96, 146, 168; 
symbols employed, 582 

Frequency polygons, 77, 78, 79; 
change in commodity prices, 104, 
180, 183, 186; changes plotted on 
logarithmic scale, 189; distribution 
of exchange rates, 105 

Frequency series, 61, 92 

Frequency tables, 64; construction, 67 

Fulkner, H. D., 323 

Functional relationship, 15 


Galton, Francis, 394 

Gantt, H. L., progress chart, 48 

Gauss, K. F., 576 

“General business”’ curve of the Ameri- 
can Telephone and Telegraph Co., 
356 

Generalization, 550 

Geometric averages of relative prices, 
198 

Geometric mean, 112; averaging ratios 
by, 188; characteristics, 136, 145; 
computing, 135; as a measure of 
central tendency, 139 

Geometric measures, arithmetic and 
harmonic compared with, 475 

Geometric series, 23 

Geometry, codrdinate, 11 

Gibson, Thomas, food price index, 241 

Gompertz curve, 297 ; 

Graphic presentation, 11, 36; frequency 
distributions, 74; references on 
methods, 59; standard rules, 51 

Graphs, 36 

Grouping process, 65 

Growth, curve of, 297 


Hall, L. W., 324 

Harmonic averages of relative prices, 
199 

Harmonic mean, 112; characteristic 
features, 145; method of employing, 
141 


INDEX 


Harmonic measures, arithmetic and 
geometric compared with, 475 

Height, human, 86, 87; soldiers’, dis- 
tribution, 98 

Histogram, 76 

Homogeneity of data, 129 

Human heights, 86, 87, 98 

Hyperbola, 22 

Hyperbolic curve, 20 


“Tdeal” index, 215, 217, 219, 346; 
substitute formula for, 218 


Income distribution, 82; frequency 
curve, 86 
Index numbers, 169, 170; references, 


251; symbols relating to, 193, 584; 
types and forms, 173, 174 

Index numbers of physical volume, 344; 
references, 362 

Index numbers of price and buying 
power of farm products, 246, 247 

Index numbers of retail prices, 244 

Index numbers of the cost of living, 245 

Index numbers of wages, 248 

Index numbers of wholesale prices, 169; 
commodities — number and kind to 
be used, 222, 223; comparison of 
simple types, 201, 202; differences 
exhibited by various systems, 242; 
fundamental types (Fisher’s), 193; 
“ideal”? index, 215, 217, 218 (for- 
mula), 219, 346; number of com- 
modities to be included, 222; price 
groups in the field, 226; purpose, 
177; reliability of different types, 
220; simple aggregative type, 193; 
summary of characteristics and uses 
of different systems, 242; testing, 
203, 212; time reversal test, 203, 204; 
various American systems described, 
229; various methods employed in 
construction, 191; various problems 
involved, 221; weighted, comparison 
of farm crop prices, 216, 217; weight- 
ing, 205. See also American index 
numbers, etc. 

Index of correlation, 436; based on log- 
arithmic values, 462; derivation of 
the formula for, 570; meaning, 437; 
short method of computing, 438; 
standard error and, in terms of re- 
ciprocals, 468 


599 


Induction, 549; statistical, 548, 550 

International index numbers of prices, 
232 

Interpolation, 81, 93 

Irrigation, alfalfa yield and, 433, 435, 
436 


Joint Committee on Standards for 
Graphic Presentation, 51 


Kansas, 486, 510 

Karsten, K. G., on selection of curves, 
305 

Kelley, T. L., on the best type of index 
number, 220 

Keynes, J. M., 552 

Killough, H. B., 455, 456 

Knibbs, G. H., 87 

Kurtosis, 110, 147, 168, 545 

Kurtz, Edwin, 88 


Labor, “prices” of, 248 

Least squares, method of, 273, 274; 
application to certain statistical 
problems, 562; for linear correlation, 
375, 378, 381, 401; references, 581 

Lettéring of diagrams, 59 

Life tables, 90 

Linear correlation, 363; details of 
calculation, 375; least squares 
method of measurement, 375, 378, 
381, 401; limitations of measures of 
relationship, 405; product-moment 
method of measurement, 385; refer- 
ences, 409 

Linear magnitudes, 52 

Linear relationship, 18 

Lines of regression, 393, 394, 4°75 

Link relatives, 318, 319; advantages, 
328 

Lipka, Joseph, 301, 303 

Logarithmic charts, 32 

Logarithmic coédrdinates on diagrams, 
55 

Logarithmic equations, 30 

Logarithmic mean, 138 

Logarithms, 26, 296; 
fitting, 290 

London-New York Exchange rates, 
105, 106, 149, 151 

London-Paris exchange rates, 156, 164, 
165 


use in curve 


600 


Macaulay, F. R., 352 

Mantissa, 27 

Manufactured goods, 227 

Markets, 5 

Mathematical concepts, 11; references, 

59 

Mathematical curves, representation 

of secular trend by, 271 

Mathematical methods, 9, 11 

Mean. See Arithmetic mean; Geomet- 
ric mean; Harmonic mean 

Mean deviation, 149, 161, 163; com- 
putation, 150, 151 | 

Measurement, 8; methods, 9 

Measurement of relationship, 
See also Relationship 

Measures, arithmetic, geometric, and 
harmonic, 474, 475 

Measures of central tendency, 109, 111 

Measures of reliability, 553; limita- 
tions, 559 

Measures of skewness, 110, 166, 546 

Measures of variation, 109, 147. 
also Variation 

Median, 112; characteristic features, 
144; graphic location, 131, 132; 
location, 119, 121; of relative prices, 
197 

Mellor, J. W., 99 

Merriman, Mansfield, 101 

Mid-points, 69, 70, 71 

Miner, J. R., 508 

Mitchell, W. C., 104, 228, 231, 242; 
comparison of index numbers based 
on varying numbers of quotations, 
222; on index numbers of prices, 181, 
184; production index, 344 

Modal group, 111 

Mode, 111; characteristic features, 
144; determining modal value from 
mean and median, 130; graphic 
location, 131, 132; location, 124 

Moment, 528, 529 

Monthly normal values, 329 

Monthly values of production as basis 
for index numbers, 351 

Mortality tables, 90, 522 

Motor vehicle registration, relation to 
taxable personal incomes, 364, 365, 
366, 369 

Moving averages, 260; application, 

263; application to series with linear 


363. 


See 


INDEX 


trend, 264; application to non-linear 
series, 266, 267; characteristics, 262; 
measurement of seasonal fluctua- 
tions by, 321; New York bank 
clearings, 261, 268; use in correlating 
cycles in time series, 426; weighted, 
270 

Multiple and partial correlation, 485; 
limitations, 514; references, 515 

Multiple correlation, application of the 
method, 499; coefficients of, 496; 
method valid for linear relation- 
ships, 499 


National Industrial Conference Board, 
index numbers of the cost of living, 
246 

Nature, Uniformity in, 550 . 

Net correlation, 500. See also Partial 
correlation 

New York (State) Census of Manu- 
factures, 406 

New York State Department of Labor, 
index of weekly earnings, 249 

Nitrogen as fertilizer, and wheat yield, 
442, 443, 444 

Non-linear correlation, 432; references, 
452 

Non-linear relationship, 20 

Normal curve of error, 107, 108, 160, 
516, 531; area in terms of abscissa, 
538; economic application, 527; fit- 
ting, 532; illustration of the measure- 
ment of areas under, 539; ordinates, 
532, 533; references, 546 

Normal equations, 276; checks on the 
formation of, 573; derivation, 563; 
formation, 564; solution, 576; solu- 
tion by the Doolittle method, 577,” 
578; standard set, 566 

Normal values, monthly, 329, 330, 332 

North Dakota, wheat yield, 114, 118 

Numerical data on diagrams, 57, 58 


Oats, production and price, 454, 455, 
457, 467, 476, 477, 479, 480, 481, 483 

Observations, 62; accuracy, 70; errors, 
99; probable error, 160. See also 
Probable error 

Ogburn, W. F., 269 

Ogive, 91, 92; frequency curve and, 
relation, 93, 94 


INDEX 


Ordinate, 12 

Ordinates of the normal curve, 532, 
533 

Organization of data, 61 


Parabola, 21; second degree, third 
degree, 25, 280, 281, 283 

Parabolic curve, 20 

Pareto, Vilfredo, law of, 140 

Partial correlation, 500; computation 
of coefficient, 506, 507; distinguished 
from simple correlation, 502; 
method, 502 

Peake, E. G., 105 

Pearl, Raymond, 296, 297, 566 

Pearson, Karl, 165, 166, 532, 537, 550; 
Chi-square test, 543; coefficient of 
correlation, 373; correlation ratio, 
441, 450; formula for skewness, 166; 
frequency curves, 545, 546 

Percentages on diagrams, 54 

Percentiles, 124 

Periodic fluctuations, 257 

Periodic functions, 25 

Perrin, Emily, 305 

Personal incomes, 81, 121; distribution 
among recipients, 82; distribution of 
recipients, 83, 84, 85, 86. See also 
Taxable personal incomes 

Persons, W. M., 192, 216, 218, 421; 
on advantages of method of link 
relatives, 328; commodity price 
index of business cycles, 238, 243; 
on the constitution of Bradstreet’s 
index, 235; on correlation of time 
series, 419; index of trade for the 
United States, 361; index of whole- 
sale prices, 223; measurement of 
seasonal fluctuations, 318; trend 
representation by related series, 
306 

Petroleum. See Coal and petroleum 
production 

Petroleum production, 291; line of 
trend fitted to logarithms, 292, 293; 
fitted to natural numbers, 294, 295 

Physical volume, index numbers, 344 

Plotting paper, 33 

Point, location of, 12 

Population, curve of growth, 297 

Potato prices, 475 

Potential series, 25, 280 


601 


Preferred stock prices, 136 

Prescott, R. B., 297 

Price changes, 175; averaging, 187; 
wholesale commodity prices, 175 

Price groups, 226 

Price ratios, 179. 
prices 

Price system, 5, 8; elements, graphic 
representation, 224 

Prices, 4; aggregates of actual, 193; 
Bradstreet’s index numbers of whole- 
sale, 173; harmonic mean in han- 
dling, 142; importance, 5; index 
numbers of (see Index numbers of 
prices); post-war trend, 306; varia- 
tion, 148; weighted aggregates of 
actual, 207, 208. See also Relative 
prices 

Probabilities, 9; a priori and empirical, 
522; addition, 518; elementary 
theorems, 516; examples of simple, 
517; measurement, 520; multipli- 
cation, 518 

Probability curve, 107 

Probable error, 160, 162, 164, 555, 556, 
559 

Product-moment method of linear cor- 
relation, 385, 390, 402 

Production, 3, 4; absence of statistics, 
344; actual and scheduled compared, 
47, 48, 49, 50; costs, 46, 47; index 
numbers based on monthly values, 
351; index numbers of E. E. Day, 
347, 348, 349, 350, 351; index of 
War Industries Board, 345, 346; 
steel, 39 

Products of factories in New York 
(State); number of wage earners as 
related to, value of, 406, 407 

Profits, 5 

Progressive mean, 270 

Projections, 304 


See also Relative 


Quantitative data, 6, 7; classification, 
97 

Quantitative methods, 6, 8 

Quartile deviation, 158, 161, 163 

Quartiles, 124; graphic location, 131, 
133, 134; symbols, 583 


Random selection, 552 
Range, 149, 161, 162 


602 


Rates, 135 

Ratio chart, 33; advantages, 38; 
advantages, summary, 44 

Ratios, 135, 137 

Raw materials, 227; giving weight to 
various classes, 228 

Reciprocals, for standard error and 
index of correlation, 468; in measure- 
ment of relationship, 465 

‘Rectangular codrdinates, 11; location 
of a point with reference to, 12 

Reed, L. J., 297 

References, list, 589 

Regression equation, 394, 454; 
399 

Regression, lines of, 393, 394 

Related series, 306 

Relationship, measurement of, 363; 
between time series, 410; compari- 
son, derived from arithmetic, geo- 
metric, and harmonic measures, 471; 
comparison of measures, 498; com- 
putation of measures of, 381; curves 
that may be used, 472; factors 
governing the choice of measures, 
473; linear correlation, 363; multi- 
ple and partial correlation, 485; 
non-linear correlation, 433; problem 
of estimation and, 453; reciprocals, 
465; symbols used, 465, 584 

Relative prices, 179; arithmetic aver- 
ages in construction of index numbers, 
195, 196; distribution of, 222; com- 
modities in 1900, 182, 183; distri- 
bution of, 346; commodities in 1914, 
179, 180; distribution of, 1437 
commodities in 1918, 185, 186; dis- 
tribution of, 1437 commodities in 
1918 plotted on logarithmic scale, 


use, 


189; geometric averages, 198; 
harmonic averages, 199; me- 
dians for index numbers, 197; 


weighted arithmetic averages, 210; 
weighted geometric averages, 211, 
212 
Reliability, measures of, 553; 
tions to measures of, 559 
Research, 8 
Residual, 274 
Retail prices. 
retail prices 


Rietz, H, L., 153 


limita- 


See Index numbers of 


INDEX 


Robertson, R. D., 433 

Root-mean-square deviation, 154, 303, 
368 

Running, T. R., 301, 303 


Sales, chart, 40, 41, 42; records, 7 

Samples, 550 

Sampling, 548; necessity of a repre- 
sentative sample, 552; references, 
561; simple, conditions, 552; stan- 
dard error of, 540, 541 

Sawmills, 94, 95; earnings of band 
sawyers, 120 

Scale for curves, reading, 56 

Scales of a diagram, 57 

Scatter, 109, 147, 372 

Scatter diagrams, 364, 366, 404 

Seasonal variations, 257; comparison 
of indices, 328; computing seasonal 
index numbers by averaging ratios 


\| to trend, 323; determining monthly 


trend values, 329, 330, 332; measure- 
ment, 315; method of link relatives, 
318, 319; references, 343; use of 
moving averages, 321 

Second degree parabola, 280, 281, 283; 
alfalfa and irrigation and, 434; busi- 
ness failures and, 286 

Second order coefficients of correlation, 
507; computation, 511 

Secular trend, 256, 257; comparison 
of lines of trend, 288; measurement, 

/ 259; references, 313; relationship 
between trends, 411; representation 
by mathematical curves, 271; repre- 
sentation by related series, 306; 
selecting appropriate type, 300 

Selected points, method of, 299 

Semi-logarithmic charts, 32, 39. See’ 
also Ratio chart 

Semi-logarithmic curves, 290 

Series, 18; arithmetic, 18; continuous 
and discrete, 85; frequency, 61, 92; 
geometric, 23; potential, 25, 280; 
time, 37, 61. See also Time series 

Sheppard, W. F., 158, 530, 537 

Short term fluctuations, correlation, 427 

Shots, distribution, 100, 101, 102 

Simplification, 8 

Sine curve, 20, 25, 26 

Skewness, 110, 147; measures of, 166, 
546; symbols, 583 


INDEX 


Slide rule, 30 

Smoothing of curves, 80, 87, 127 

Snyder, Carl, 256 

Soldiers, distribution by heights, 98 

Spearman, Charles, 436 

Speedwell Automobile Co., 47, 48 

Standard deviation, 154, 162, 163, 368, 
405, 513; computation, 154, 156 

Standard error of estimate, 368, 375; 
application, 461; computation, 368, 
372, 495; computation formula, 376; 
in logarithmic terms, 457; in terms 
of ratios, 460; index of correlation 
and, in terms of reciprocals, 468; in- 
terpretation, 459, 469; derivation of 
the formula for, 567 

Standard error of sampling, 540, 541 

Standard errors of chief statistical 
measures, 555 

Statistical description, 548 

Statistical induction, 548; references,561 

Statistical methods, 3; definition, 6; 
external problems and, 8; internal 
administration problems and, 6; 
limitations, 9 

Statistical results, 549; generalization, 
550 

Statistical tables, 73. See also Tables 

Steel production, 39 

Steinmetz, C. P., 301; on empirical 
curves, 283, 284 

Stewart, Ethelbert, 95 

Stock fluctuations, 149 

Stock price cycles as related to business 
cycles, 421, 423, 424, 425, 426 

Straight line, 16; business failures and, 
286; fitting, 273, 278 

Symbols, glossary of, 582 

Symmetry, 147 

Tables, cumulative, 90; statistical, 
structure, 73 

Taxable personal incomes in relation 
to motor vehicle registration, 364, 
365, 366, 369 

Telephone messages, 527, 528, 535, 
536, 543 

Telephone poles, 88, 89, 90, 91, 92, 
130 

Temperature and corn yield, 485, 486 

Testing goodness of fit, 535, 540; 
Chi-square test, 543 


603 


Third degree parabola, 280, 282, 284; 
business failures and, 287 

Thomas, Dorothy, 269 

Time element, 256 

Time rates, 141 

Time reversal test, 203, 204, 212 

Time series, 37, 61; analysis; measure- 
ment of seasonal and _ cyclical 
fluctuations, 315; analysis: measure- 
ment of trend, 252; data and pre- 
liminary organization, 253; difficul- 
ties in the correlation, 419; fluctua- 
tions, periodic and random, 257, 
258; forces affecting, 255; general 
considerations relating to analysis, 
342; graphic representation, 254; 
measurement of relationship, 410; 
references on correlation of, 431 

Titles of diagrams, 59 

Tolley, H. R., 491 

Trade, index of volume, by Federal 
Reserve Bank of New York, 359, 361 

Trend, 256; concept, 304, 306; measure- 
ment, 252; references, 313. See 
also Secular trend 


Uniformity of nature, 550 

U.S. Bureau of Labor Statistics, 207, 
222, 253; index numbers of the cost 
of living, 246; index of retail food 
prices, 244; index of wholesale prices, 
229, 243 

U.S. Department of Agriculture, 246, 
248; index numbers, 246, 247 

U.S. Steel Corporation, unfilled ton- 
nage, 254 


Value, 8 

Variability, measures of, 512 

Variables, 14; correlation of several, 
485; independent and dependent, 
14; symbols for, 14, 582 

Variation, 106; absolute and relative, 
148; characteristic features of the 
chief measures, 162; coefficient of, 
165; measure of, 109; measures of 
absolute, 148, 149; measurement of 
relative, 164; nature and significance, 
147; references, 168; relations be- 
tween different measures, 161; sym- 
bols, for measures of, 583 

Visualization, 36 


604 


Volume of trade, Federal Reserve Bank 
index, 359, 361 


Wage-earners, cumulative distribution, 
133; number as related to value of 
their products, 406, 407 

Wages, index numbers of money wages 
and real, 248; index of real, 250 

Walsh, C. M., 140 

War Industries Board, index numbers of 
wholesale prices, 233, 243; produc- 
tion index, 345, 346 

Weekly earnings, 248; as basis of 
frequency distribution, 65, 72, 74; 
distribution, 132, 133, 134 

Weighted aggregates of actual prices, 
207, 208 

Weighted arithmetic averages of rela- 
tive prices, 210 

Weighted arithmetic mean, 115 

Weighted average, 115 

Weighted geometric averages of rela- 
tive prices, 211, 212 


INDEX 


Weighting of index numbers, Fisher’s 
four methods, 209 

Weld, L. D., 527 

Weldon, W. F. R., 523 

Wheat flour exports, 38 

Wheat, nitrogen fertilizer and yield, 
442, 443, 444; No. 1 Northern 
Spring, relative price, index numbers, 
170; yield in North Dakota, 114, 118 

Whipple, G. C., 98 

Wholesale prices, 306. See also Index 
numbers of wholesale prices 

Working, Holbrook, 475 


Yule, G. U., 71, 155, 446, 523, 552 


Zero line, 52, 53 

Zero order coefficient of correlation, 
507 

Zone of dispersion in artillery firing, 
100, 101, 102 

Zones of estimate, 384; significance, 
477 


- ree sa 
ike om — a Bis » 
par? 50 toy: 
ee » 


ss © 
4 


~ 


4 


Lis y 


= 
O 
Hwee’, 
> 
~~ 
SS 


HA 40 .E3M5 1930 
Mills, 


Frederick Cecil 
| 20904 


| Statistical Methods applied 
to Economics and business. 


_ Numéro at | Numero 
Date du lecteur 


ed. ata 


Ce volume doit étre rendu 8 la derniére 


date indiquée ci-dessous. 


